Suppose I have Airflow tasks like this:
apple_task = DummyOperator(
task_id='apple'
)
banana_task = DummyOperator(
task_id='banana'
)
cherry_task = DummyOperator(
task_id='cherry'
)
apple_task >> cherry_task
banana_task >> cherry_task
Do the repeated applications of >> stack or replace the previous one?
What will the graph look like?
Airflow 2.2.2
They stack, as in apple_task and banana_task will be ran in parallel, both must succeed to run cherry_task.
It's equivalent to [apple_task, banana_task] >> cherry_task.
The scheduler parses the DAGs (every 30s by default), and the DAG is read and the graph is constructed. An advantage to specifying task dependencies as you did, you can dynamically create tasks at parse time - as they're just python objects.
The DAG documentation page has some more examples under the task dependencies heading here and the control flow heading here.
Related
By transactional-like I mean - etither all tasks in a set are successful or none of them are and should be retried from the first one.
Consider two operators, A and B, A downloads a file, B reads it and performs some actions.
A successfully executes, but before B comes into play a blackout/corruption occurs, and B cannot process the file, so it fails.
To pass, it needs A to re-download the file, but since A is in success status, there is no direct way to do that automatically. A human must go and clear A`s status
Now, if wonder, if there is a known way to clear the statuses of task instances up to a certain task, if some task fails?
I know I can use a hook, as in clear an upstream task in airflow within the dag, but that looks a bit ugly
This feature is not supported by airflow, but you can use on_failure_callback to achieve that:
def clear_task_A(context):
execution_date = context["execution_date"]
clear_task_A = BashOperator(
task_id='clear_task_A',
bash_command=f'airflow tasks clear -s {execution_date} -t A -d -y <dag_id>'
) # -s: start date, -t: task_id regex -d: include downstream?, -y: yes?
return clear_task_A.execute(context=context)
with DAG(...) as dag:
A = DummyOperator(
task_id='A'
)
B = BashOperator(
task_id='B',
bash_command='exit 1',
on_failure_callback=clear_task_A
)
A >> B
Airflow graph task branch never runs, complains “Task instance did not exist in the DB”, but can see in graph.
I have an airflow graph with a conditional branch defined like
class BranchFlags(Enum):
yes = "yes"
no = "no"
...
for table in list_of_tables # type list(dict)
task_1 = BashOperator(
task_id='task_1_%s' % table["conf1"],
bash_command='bash script1.sh %s' % table["conf1"],
dag=dag)
if table["branch_flag"] == BranchFlags.yes:
consolidate = BashOperator(
task_id='task_3_%s' % table["conf2"],
bash_command='python %s/consolidate_parquet.py %s' % table["conf2"],
dag=dag)
task_3 = BashOperator(
task_id='task_3_%s' % table["conf3"],
bash_command='bash script3.sh %s' % table["conf3"],
dag=dag)
task_1 >> task_3
if table["branch_flag"] == BranchFlags.yes:
task_1 >> task_2
and here is the graph in the airflow UI from my actual code:
Notice that even though the longer parts of the graph run fine, the lone branch is not being run for the one sequence that was supposed to branch. When viewing the logs for the task, I see
*** Task instance did not exist in the DB
This is weird to me, since ostensibly the scheduler DB sees the task since it does appear in the web UI graph. Not sure what is going on here and adding other changes to the dag .py file do show up in the graph and are executed by the scheduler when running the graph. And attempting to view the tasks Task Instance Details throws error
Task [dagname.task_3_qwerty] doesn't seem to exist at the moment
Running airflow resetdb (as I've seen in other posts) does nothing for the problem.
Note that the intention is that the short branch runs concurrently with the longer branch (not as an either or choice).
Anyone know why this would be happening or have some debugging tips?
I have a main dag which retrieves a file and splits the data in this file to separate csv files.
I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery)
How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)
So right now, this is what it looks like
main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...) # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files
def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
...
...
How can I call the subdag_factory for each file generated in transform_operator?
I tried creating subdags dynamically as follows
# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
# dag params
dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
default_args_copy = default_args.copy()
# dag
dag = DAG(dag_id=dag_id_child,
default_args=default_args_copy,
schedule_interval='#once')
# operators
tid_check = 'check2_db_' + db_name
py_op_check = PythonOperator(task_id=tid_check, dag=dag,
python_callable=check_sync_enabled,
op_args=[db_name])
tid_spark = 'spark2_submit_' + db_name
py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
python_callable=spark_submit,
op_args=[db_name])
py_op_check >> py_op_spark
return dag
# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
tid_subdag = 'subdag_' + db_name
subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
return sd_op
# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
# chain subdag-operators together
airflow.utils.helpers.chain(*subdags)
return subdags
# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
default_args=default_args,
schedule_interval=None)
subdag_ops = create_subdag_operators(dag, db_names)
Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.
The resulting DAG looks like this
Diving into SubDAG(s)
Airflow deals with DAG in two different ways.
One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
session = settings.Session()
dr = DagRun(
dag_id=dag_to_be_triggered,
run_id=uuid_run_id,
conf={'file_path': path_to_the_file},
execution_date=datetime.now(),
start_date=datetime.now(),
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:
def work_with_the_file(**context):
path_to_file = context['dag_run'].conf['file_path'] \
if 'file_path' in context['dag_run'].conf else None
if not path_to_file:
raise Exception('path_to_file must be provided')
Pros all the flexibility and functionality of Airflow
Cons the monitor DAG can be spammy.
My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.
I need the status of the task like if it is running or upforretry or failed within the same dag. So i tried to get it using the below code, though i got no output...
Auto = PythonOperator(
task_id='test_sleep',
python_callable=execute_on_emr,
op_kwargs={'cmd':'python /home/hadoop/test/testsleep.py'},
dag=dag)
logger.info(Auto)
The intention is to kill certain running tasks once a particular task on airflow completes.
Question is how do i get the state of a task like is it in the running state or failed or success
I am doing something similar. I need to check for one task if the previous 10 runs of another task were successful.
taky2 sent me on the right path. It is actually fairly easy:
from airflow.models import TaskInstance
ti = TaskInstance(*your_task*, execution_date)
state = ti.current_state()
As I want to check that within the dag, it is not neccessary to specify the dag.
I simply created a function to loop through the past n_days and check the status.
def check_status(**kwargs):
last_n_days = 10
for n in range(0,last_n_days):
date = kwargs['execution_date']- timedelta(n)
ti = TaskInstance(*my_task*, date) #my_task is the task you defined within the DAG rather than the task_id (as in the example below: check_success_task rather than 'check_success_days_before')
state = ti.current_state()
if state != 'success':
raise ValueError('Not all previous tasks successfully completed.')
When you call the function make sure to set provide_context.
check_success_task = PythonOperator(
task_id='check_success_days_before',
python_callable= check_status,
provide_context=True,
dag=dag
)
UPDATE:
When you want to call a task from another dag, you need to call it like this:
from airflow import configuration as conf
from airflow.models import DagBag, TaskInstance
dag_folder = conf.get('core','DAGS_FOLDER')
dagbag = DagBag(dag_folder)
check_dag = dagbag.dags[*my_dag_id*]
my_task = check_dag.get_task(*my_task_id*)
ti = TaskInstance(my_task, date)
Apparently there is also an api-call by now doing the same thing:
from airflow.api.common.experimental.get_task_instance import get_task_instance
ti = get_task_instance(*my_dag_id*, *my_task_id*, date)
Take a look at the code responsible for the command line interface operation suggested by Priyank.
https://github.com/apache/incubator-airflow/blob/2318cea74d4f71fba353eaca9bb3c4fd3cdb06c0/airflow/bin/cli.py#L581
def task_state(args):
dag = get_dag(args)
task = dag.get_task(task_id=args.task_id)
ti = TaskInstance(task, args.execution_date)
print(ti.current_state())
Hence, it seem you should easily be able to accomplish this within your DAG codebase using similar code.
Alternatively you could execute these CLI operations from within your code using python's subprocess library.
Okay, I think I know what you're doing and I don't really agree with it, but I'll start with an answer.
A straightforward, but hackish, way would be to query the task_instance table. I'm in postgres, but the structure should be the same. Start by grabbing the task_ids and state of the task you're interested in with a db call.
SELECT task_id, state
FROM task_instance
WHERE dag_id = '<dag_id_attrib>'
AND execution_date = '<execution_date_attrib>'
AND task_id = '<task_to_check>'
That should give you the state (and name, for reference) of the task you're trying to monitor. State is stored as a simple lowercase string.
You can use the command line Interface for this:
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date
For more on this you can refer official airflow documentation:
http://airflow.incubator.apache.org/cli.html