Lets assume we have 2 tasks as airflow operators: task_1 and task_2. We want to skip task_1 on Mondays and run both tasks on the rest of the days. In addition we also want to re-run both tasks on monday at a later time.
Example of a monday run:
Run 1: task_2
Run 2: task_1 -> task_2
Example of rest of week runs:
Run 1: task_1 -> task_2
So on mondays first task gets skipped and in a later second run both tasks run, whereas on all other days both tasks run as usual.
We already looked at the BranchDayOfWeekOperator() (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/weekday.html) but this does not solve the second re-run issue on mondays.
Can you help us with the re-run issue ?
Thanks in advance.
If you want to use BranchDayOfWeekOperator you can skip task_1 on Monday, and execute it on other days, and set the branch and the task_1 as upstream for task_2 and set its trigger rule to ONE_SUCCESS in order to execute it for the two branchs:
from airflow import DAG
from datetime import datetime
from airflow.utils.trigger_rule import TriggerRule
from airflow.operators.empty import EmptyOperator
from airflow.operators.weekday import BranchDayOfWeekOperator, WeekDay
with DAG(
'dag',
schedule_interval="#daily",
start_date=datetime(2022, 12, 1),
) as dag:
branch = BranchDayOfWeekOperator(
task_id="skip_monday",
follow_task_ids_if_true="task_2",
follow_task_ids_if_false="task_1",
week_day={WeekDay.MONDAY},
)
task_1 = EmptyOperator(task_id="task_1")
task_2 = EmptyOperator(task_id="task_2", trigger_rule=TriggerRule.ONE_SUCCESS)
branch >> [task_1, task_2]
task_1 >> task_2
But you need to know that this operator use the start date by default, so if you want to skip Monday runs and not the runs executed on Monday (backfill or cleared runs), you need to add the argument use_task_logical_date=True.
In addition we also want to re-run both tasks on Monday at a later time.
if you want to run it later, you can run it manually using Airlfow UI, and activating Ignore All Deps
Related
I want to create a DAG to run in Google Cloud Composer. The workflow contains a ParallelFor and I don´t know how to model that.
The workflow looks something like this:
task1 >> task2 >> task3 >> task4
where task2 splits data into x arrays. Now, I want to run task3 in parallel for these x arrays. Task3 outputs something and task4 combines the outputs.
(you can find a picture of the workflow here: https://github.com/Apollo-Workflows/Sentiment-Analysis)
For now, I have two possible ideas how it could work:
There is an easy syntax for it (like >> for sequential execution). But I did not found such syntax
Working with sub-DAGs. My idea was to append task2 so that it creates x subDAGs (one for each array). The subDAG is basically task3. After all subDAGs are finished, their output is forwarded to task4. Is that possible? If yes, how do I do it?
I have found a solution for my problem. It follows my first possible solution idea. Just use the mechanics from this link:
Airflow rerun a single task multiple times on success
I believe that the post you mentioned as a possible idea, points on the direction on how to run a task after the previous one has ended.
To run dags in parallel you should follow a structure similar to this
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG("dag_paralel", description="Starting tutorial", schedule_interval=None,
start_date=datetime(2019, 1, 1),
catchup=False)
task_1 = BashOperator(task_id='task_1', bash_command='echo "This is task 1!"',dag=dag)
task_2 = BashOperator(task_id='task_2', bash_command='echo "This is task 2!"',dag=dag)
task_list = []
max_attempt = 3
for attempt in range(max_attempt):
data_pull = BashOperator(
task_id='task_3_{}'.format(attempt),
bash_command='echo "This is task - 3_{}!"'.format(attempt),
dag=dag
)
task_list.append(data_pull)
data_validation = BashOperator(task_id='task_final', bash_command='echo "We are at the end"',dag=dag)
task_1 >> task_2 >> task_list
task_list >> data_validation
This is the DAG structure obtained by this method
I have to schedule a DAG that should run on 15th of every month. However, if 15th falls on a Sunday/Saturday then the DAG should skip weekends and run on coming Monday.
For example, May 15 2021 falls on a Saturday. So, instead of running on 15th of May, the DAG should run on 17th, which is Monday.
Can you please help to schedule it in airflow?
Thanks in advance!
The logic of scheduling is limited by what you can do with single cron expression. So if you can't say it in cron expression you can't provide such scheduling in Airflow. For that reason there is an open airflow improvement proposal AIP-39 Richer scheduler_interval to give more scheduling capabilities.
That said, you can still get the desired functionality by writing some code.
You can set your dag to start on the 15th of each month and then place a sensor that verify that the date is Mon-Fri (if not it will wait):
from airflow.sensors.weekday import DayOfWeekSensor
dag = DAG(
dag_id='work',
schedule_interval='0 0 15 * *',
default_args=default_args,
description='Schedule a Job on 15 of each month',
)
weekend_check = DayOfWeekSensor(
task_id='weekday_check_task',
week_day={'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'},
mode='reschedule',
dag=dag)
op_1 = YourOperator(task_id='op1_task',dag=dag)
weekend_check >> op_1
Note: If you are running airflow<2.0.0 you will need to change the import to:
from airflow.contrib.sensors.weekday_sensor import DayOfWeekSensor
The answer posted by Elad works pretty well. I came up with another solution that works as well.
I scheduled the job to run on 15,16 and 17 of the month. However, I added a condition so that the job runs on the 15th if its a weekday. The job runs on 16th and 17th if its a Monday.
To achieve that, I added a BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
def _conditinal_task_initiator(**kwargs):
execution_date=kwargs['execution_date']
if int(datetime.strftime(execution_date,'%d'))==15 and (execution_date.weekday()<5):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==16 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
elif int(datetime.strftime(execution_date,'%d'))==17 and (execution_date.weekday()==0):
return 'dummy_task_run_cmo_longit'
else:
return 'dummy_task_skip_cmo_longit'
with DAG(dag_id='NXS_FM_LOAD_CMO_CHOICE_LONGIT',default_args = default_args, schedule_interval = "0 8 15-17 * *", catchup=False) as dag:
conditinal_task_initiator=BranchPythonOperator(
task_id='cond_task_check_day',
provide_context=True,
python_callable=_conditinal_task_initiator,
do_xcom_push=False)
dummy_task_run_cmo_longit=DummyOperator(
task_id='dummy_task_run_cmo_longit')
dummy_task_skip_cmo_longit=DummyOperator(
task_id='dummy_task_skip_cmo_longit')
conditinal_task_initiator >> [dummy_task_run_cmo_longit,dummy_task_skip_cmo_longit]
dummy_task_run_cmo_longit >> <main tasks for execution>
Using this, the job'll run on 15,16 and 17 of every month. However, it'll run the functional tasks only once every month.
I have a requirement that I want to schedule an airflow job every alternate Friday. However, the problem is I am not able to figure out how to write a schedule for this.
I don't want to have multiple jobs for this.
I tried this
'0 0 1-7,15-21 * 5
However it's not working it's running from 1 to 7 and 15 to 21 everyday.
from shubham's answer I realize that we can have a PythonOperator which can skip the task for us. An I tried to implement the solution. However doesn't seem to work.
As testing this on 2 week period would be too difficult. This is what I did.
I schedule the DAG to run every 5 mins
However, I am writing python operator the skip althernate task (pretty similar to what I am trying to do, alternate friday).
DAG:
args = {
'owner': 'Gaurang Shah',
'retries': 0,
'start_date':airflow.utils.dates.days_ago(1),
}
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
schedule_interval='*/5 * * * *',
max_active_runs=1
)
dummy_op = DummyOperator(task_id='dummy', dag=dag)
def _check_date(execution_date, **context):
min_date = datetime.now() - relativedelta(minutes=10)
print(context)
print(context.get("prev_execution_date"))
print(execution_date)
print(datetime.now())
print(min_date)
if execution_date > min_date:
raise AirflowSkipException(f"No data available on this execution_date ({execution_date}).")
check_date = PythonOperator(
task_id="check_if_min_date",
python_callable=_check_date,
provide_context=True,
dag=dag,
)
I doubt that a single crontab expression can solve this
Using airflow's tricks, solution is much more straightforward:
schedule your DAG every Friday 0 0 * * FRI and
on alternate Fridays (based on your business logic), skip the DAG by raising AirflowSkipException
Here you'll have to let your DAG begin with a dedicated skip_decider task that will let your DAG run / skip every alternate Friday by
conditionally raising AirflowSkipException (to skip the DAG)
not doing anything to let the DAG run
You can also leverage
ShortCircuitOperator
BranchPythonOperator
but IMO, AirflowSkipException is the cleanest solution
Reference: How to define a DAG that scheduler a monthly job together with a daily job?
Depending on your implementation you can use the hash. Worked in my airflow schedules using version 1.10:
Hash (#)
'#' is allowed for the day-of-week field, and must be followed by a number between one and five. It allows specifying constructs such as "the second Friday" of a given month.[19] For example, entering "5#3" in the day-of-week field corresponds to the third Friday of every month. Reference
you can use timedelta as mentioned below, combine it with start_date to schedule your job bi_weekly.
dag = DAG(
dag_id='test_dag',
default_args=args,
catchup=False,
start_date=datetime(2021, 3, 26),
schedule_interval=timedelta(days=14),
max_active_runs=1
)
I have scheduled the execution of a DAG to run daily.
It works perfectly for one day.
However each day I would like to re-execute not only for the current day {{ ds }} but also for the previous n days (let's say n = 7).
For example, in the next execution scheduled to run on "2018-01-30" I would like Airflow not only to run the DAG using as execution date "2018-01-30", but also to re-run the DAGs for all the previous days from "2018-01-23" to "2018-01-30".
Is there an easy way to "invalidate" the previous execution so that a backfill is run automatically?
You can generate dynamically tasks in a loop and pass the offset to your operator.
Here is an example with the Python one.
import airflow
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import timedelta
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
'schedule_interval': '0 10 * * *'
}
def check_trigger(execution_date, day_offset, **kwargs):
target_date = execution_date - timedelta(days=day_offset)
# use target_date
for day_offset in xrange(1, 8):
PythonOperator(
task_id='task_offset_' + i,
python_callable=check_trigger,
provide_context=True,
dag=dag,
op_kwargs={'day_offset' : day_offset}
)
Have you considered having the dag that runs once a day just run your task for the last 7 days? I imagine you’ll just have 7 tasks that each spawn a SubDAG with a different day offset from your execution date.
I think that will make debugging easier and history cleaner. I believe trying to backfill already executed tasks will involve deleting task instances or setting their states all to NONE. Then you’ll still have to trigger a backfill on those dag runs. It’ll be harder to track when things fail and just seems a bit messier.
My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.