What is the usage of DummyOperator in Airflow? - airflow

I 'm new to Airflow and I know DummyOperator just does nothing.
So what is the scenario for DummyOperator?
When would you typically use it?

A common use is to create simplified workflows. Consider an example.
task_1 >> task_3
task_2 >> task_3
task_1 >> task_4
task_2 >> task_4
Technically you want task_3 and task_4 to be executed only after both task_1 and task_2 are completed. But when you look at the graph, it is not super intuitive.
Solution? You can improve the readability (not code readability instead you can understand the graphs and thereby workflow.) by adding a task_dummy after task_1 and task_2 and run task_3 and task_4 after task_dummy. So when a new user takes a look at graphs, he will immediately understand the workflow. The modified workflow will be as follows.
task_1 >> task_dummy << task_2
task_dummy >> task_3
task_dummy >> task_4

Related

Apache Airflow: Conditionals running before triggered

I'm having a problem with my DAG. I want to set it up to where if one task fails, another happens and the entire run doesn't fail.
The code is proprietary so I can't post the code snippet. So sorry!
Task0 >> [Task1, Task2]
Task1 >> Task1a
If Task1 fails, I want task2 to execute. If task1 is successful, I want task1a to execute. My current code for task2 looks like this:
task2 = DummyOperator(
task_id='task2',
trigger_rule='one_failed',
dag=dag,
)
I've been playing around with the trigger_rule but this keeps running before task1. It just runs right away.
Your operator is fine. The dependency is wrong.
The Task0 >> [Task1, Task2] >> Task1a means that Task1 can run in parallel with Task2 and the trigger_rule='one_failed' of Task2 is checked against it's direct upstream tasks. This means that the rule is checked against Task0 status not against Task1.
To fix your issue you need to change:
Task0 >> Task1 >> Task2
Task1 >> Task1a

Airflow configuration recommendation for LocalExecutor

I am using airflow docker-compose from here and I have some performance issue along with strange behavior of airflow crashing.
First I have 5 DAGs running at the sametime, each one of them has 8 steps with max_active_runs=1:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
I would like to know what configuration should I use in order to maximize Airflow parallelism vs Stability. i.e.: I want to know what is the maximum recommanded [OPTIONS BELOW] for a machine that has X CPU and Y GB of RAM.
I am using a LocalExecutor but can't figure out how should I configure the parallelism:
AIRFLOW__SCHEDULER__SCHEDULER_MAX_THREADS=?
AIRFLOW__CORE__PARALLELISM=?
AIRFLOW__WEBSERVER__WORKERS=?
Is there a documentation that states the recommandation for each one of those based on your machine specification ?
I'm not sure you have a parallelism problem...yet.
Can you clarify something? You have 5 different dags with similar set-ups? Or this is launching five instances of the same task at once? I'd expect the former because of the max_active_runs setting.
On your task declaration here:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
Are you expecting step1x, step2y and step3 to all execute at the same time? Then 4-7 and finally step8? What are you doing in the DAG where you need that kind of process vs 1-8 sequential?

For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?

I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli.
For example:
My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range).
So can I create such an airflow DAG, when it's scheduled, that the default time range is from 01:30 yesterday to 01:30 today. Then if anything wrong with the data source, I need to manually trigger the DAG and manually pass the time range as parameters.
As I know airflow test has -tp that can pass params to the task. But this is only for testing a specific task. and airflow trigger_dag doesn't have -tp option. So is there any way to tigger_dag and pass parameters to the DAG, and then the Operator can read these parameters?
Thanks!
You can pass parameters from the CLI using --conf '{"key":"value"}' and then use it in the DAG file as "{{ dag_run.conf["key"] }}" in templated field.
CLI:
airflow trigger_dag 'example_dag_conf' -r 'run_id' --conf '{"message":"value"}'
DAG File:
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag,
)
This should work, as per the airflow documentation: https://airflow.apache.org/cli.html#trigger_dag
airflow trigger_dag -c '{"key1":1, "key2":2}' dag_id
Make sure the value of -c is a valid json string, so the double quotes wrapping the keys are necessary here.
key: ['param1=somevalue1', 'param2=somevalue2']
First way:
"{{ dag_run.conf["key"] }}"
This will render the passed value as String "['param1=somevalue1', 'param2=somevalue2']"
Second way:
def get_parameters(self, **kwargs):
dag_run = kwargs.get('dag_run')
parameters = dag_run.conf['key']
return parameters
In this scenario, a list of strings is being passed and will be rendered as a list ['param1=somevalue1', 'param2=somevalue2']

Rerun Airflow Dag From Middle Task and continue to run it till end of all downstream tasks.(Resume Airflow DAG from any task)

Hi i am new to Apache Airflow i have dag of dependancies lets say
Task A >> Task B >> Task C >> Task D >> Task E
Is it possible to run Airflow DAG from middle task lets say Task C ?
Is it possible to run only specific branch in case of branching
operator in middle?
Is it possible to resume Airflow DAG from last failure task?
If not possible how to manage large DAG's and avoid rerunning
redundant tasks?
Please provide me suggestions on how to implement this if possible.
You can't do it manually. If you set BranchPythonOperator you can skip tasks till the task you wish to start with according to the conditions set in the BranchPythonOperator
Same as 1.
yes. You can clear tasks upstream till root or down stream till all leaves of the node.
You can do something like:
Task A >> Task B >> Task C >> Task D
Task C >> Task E
Where C is the branch operator.
For example:
from datetime import date
def branch_func():
if date.today().weekday() == 0:
return 'task id of D'
else:
return 'task id of E'
Task_C = BranchPythonOperator(
task_id='branch_operation',
python_callable=branch_func,
dag=dag)
This will be task sequence on Monday :
Task A >> Task B >> Task C >> Task D
This will be task sequence on rest of the week:
Task A >> Task B >> Task C >> Task E

How to exit with error from script to Airflow?

Say I'm running:
t = BashOperator(
task_id='import',
bash_command="""python3 script.py '{{ next_execution_date }}' """,
dag=dag)
And for some reason I want the script to exit with error and indicate airflow that he should retry this task.
I tried to use os._exit(1) but Airflow mark the task as success.
I know there is:
from airflow.exceptions import AirflowException
raise AirflowException("error msg")
But this is more for functions written in a DAG. My script is independent and sometimes we run it alone regardless of airflow.
Also the script is Python3 while Airflow is running under Python 2.7
It's seems excessive to install Airflow on Python3 just for error handling.
Is that any other solution?
Add || exit 1 at the end of your Bash command:
bash_command="""python3 script.py '{{ next_execution_date }}' || exit 1 """
More information: https://unix.stackexchange.com/a/309344

Resources