Currently I have a list of tasks that all need to run at the same time every day, however they are all independent of one another. I know I can set them to run in a certain order i.e. t1 >> t2 >> t3, however I would like the order to be random so the order they finish isn't always the same. How can I run a list of airflow tasks in a random order?
You've just said that they are independent of each other, why don't you just run them all at the same time?
This can be achieved by simply not using any shift operators, e.g:
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'Airflow',
'start_date': days_ago(0)
}
dag = DAG(dag_id='example_random_task', default_args=args, max_active_runs=0, catchup=False)
first_operator = DummyOperator(task_id='{}_operator'.format("first"), dag=dag)
second_operator = DummyOperator(task_id='{}_operator'.format("second"), dag=dag)
third_operator = DummyOperator(task_id='{}_operator'.format("third"), dag=dag)
But if you really want to have random order of tasks and make them executable in some kind of random queue, you can add all your tasks to a list and them just shuffle. Then iterate over tasks and make current depended by the next one, for example:
To do so, use random.shuffle() which shuffles list in-place:
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago
import random
args = {
'owner': 'Airflow',
'start_date': days_ago(0)
}
dag = DAG(dag_id='example_random_task', default_args=args, max_active_runs=0, catchup=False)
first_operator = DummyOperator(task_id='{}_operator'.format("first"), dag=dag)
second_operator = DummyOperator(task_id='{}_operator'.format("second"), dag=dag)
third_operator = DummyOperator(task_id='{}_operator'.format("third"), dag=dag)
tasks_list = [first_operator, second_operator, third_operator]
random.shuffle(tasks_list)
i = 0
while i < len(tasks_list) - 1:
tasks_list[i] << tasks_list[i + 1]
i += 1
Have fun!
Related
I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.
I am new to airflow and not sure how to create my first DAG. I used Pycharm to create the first DAG_try.py like below:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'myname',
'email': 'myemail',
'start_date': datetime(2021, 4, 29)
}
dag = DAG(dag_id='DAG-try', default_args=default_args, schedule_interval='#once')
Fidelity = BashOperator(task_id='Fidelity_data',
bash_command='python Fidelity.py',
dag=dag)
Gemini = BashOperator(task_id='Gemini_data',
bash_command='python Gemini.py',
dag=dag)
Get_results = BashOperator(task_id='get_results',
bash_command='python get_results.py',
dag=dag)
[Fidelity, Gemini] >> Get_results
And put it under the direction C:\Users\name\AppData\Local\Packages\CanonicalGroupLimited.UbuntuonWindows_79rhkp1fndgsc\LocalState\rootfs\home\username\airflow\dags\
But my DAG is not shown in Airflow UI or in the dags list.
Under the ...\airflow\airflow.cfg file, dags_folder = ~/airflow/dags
I am not sure the direction is setting correctly or not. Could anyone help me with this?
Thanks a lot in advance!
I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)
is it possible to run two dags at another time with the externalTaskSensor?
I have two DAGs.
DAG A runs every two hours
10 a.m. (successful)
12a.m. (failed)
2 p.m. (successful)
Dag B is depended on DAG A. DAG B waits for DAG A at 12 a.m and fails, because DAG A failed. But since DAG A was successful at 2 p.m., Dag B should suppose to run.
How can you implement this? With an ExternalTaskSensor?
I just have a small dummy, to try to understand it.
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.timezone import datetime
from datetime import datetime, timedelta
import airflow
source_dag = DAG(
dag_id='sensor_dag_source',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
first_task = DummyOperator(task_id='first_task', dag=source_dag)
target_dag = DAG(
dag_id='sensor_dag_target',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
task_sensor = ExternalTaskSensor(
dag=target_dag,
task_id='dag_sensor_source_sensor',
retries=100,
retry_delay=timedelta(seconds=30),
mode='reschedule',
external_dag_id='sensor_dag_source',
external_task_id='first_task'
)
first_task = DummyOperator(task_id='first_task', dag=target_dag)
task_sensor >> first_task
you can try and use TriggerDagRunOperator and trigger DAG B from DAG A
here is a full answer-
In airflow, is there a good way to call another dag's task?
there is another good post about it-
Wiring top-level DAGs together
I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12