What is an alternative way to coordinate parallel tasks in airflow - airflow

I have recently started working with apache airflow and my dags and workflow run perfectly, however, I am looking for another way to coordinate the dependencies by excluding a task within the workflow.
The below code produces the following
start >> spark_job >> sql_job>> [getfile,getfile2] >> t2>> [renamefile,renamefile2] >> t8>> t9
>> t10>> end
I am looking for a way to exclude the sleep task where get_file connects directly to Renam_file2 and get_file2 connects to Renam_file

You cannot have dependencies between arrays but you can break down the dependencies to achieve direct connection between get_file to Rename_file.
I literally took what you mentioned in description, but are you sure you want to connect get_file to Renam_file2?
start >> spark_job >> sql_job >> [getfile, getfile2]
getfile >> renamefile2 # opposite??
getfile2 >> renamefile # opposite??
[renamefile, renamefile2] >> t8 >> t9 >> t10 >> end

Related

Airflow retry of multiple tasks as a group

I have a group of tasks that should run as a unit, in the sense if any of the tasks from the group fail, the whole group should be marked as failed.
I want to be able to retry the group when it has failed.
For example, I have a DAG with these tasks:
taskA >> (taskB >> taskC) >> taskD
I want to say that (taskB >> taskC) is a group.
If either taskB or taskC fails, I want to be able to rerun the whole group (taskB >> taskC).
This is a two parts question.
First, In Airflow downstream task can not effect upstream task. Assuming structure of:
taskA >> taskB >> taskC >> taskD
then if taskB is successful and taskC failed. it can not change the state of taskB to failed.
Second, clearing (rerun) a TaskGroup is a feature that currently is not available. There is an open feature request for it in Airflow repo. You can view it in this link.

Airflow configuration recommendation for LocalExecutor

I am using airflow docker-compose from here and I have some performance issue along with strange behavior of airflow crashing.
First I have 5 DAGs running at the sametime, each one of them has 8 steps with max_active_runs=1:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
I would like to know what configuration should I use in order to maximize Airflow parallelism vs Stability. i.e.: I want to know what is the maximum recommanded [OPTIONS BELOW] for a machine that has X CPU and Y GB of RAM.
I am using a LocalExecutor but can't figure out how should I configure the parallelism:
AIRFLOW__SCHEDULER__SCHEDULER_MAX_THREADS=?
AIRFLOW__CORE__PARALLELISM=?
AIRFLOW__WEBSERVER__WORKERS=?
Is there a documentation that states the recommandation for each one of those based on your machine specification ?
I'm not sure you have a parallelism problem...yet.
Can you clarify something? You have 5 different dags with similar set-ups? Or this is launching five instances of the same task at once? I'd expect the former because of the max_active_runs setting.
On your task declaration here:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
Are you expecting step1x, step2y and step3 to all execute at the same time? Then 4-7 and finally step8? What are you doing in the DAG where you need that kind of process vs 1-8 sequential?

Rerun Airflow Dag From Middle Task and continue to run it till end of all downstream tasks.(Resume Airflow DAG from any task)

Hi i am new to Apache Airflow i have dag of dependancies lets say
Task A >> Task B >> Task C >> Task D >> Task E
Is it possible to run Airflow DAG from middle task lets say Task C ?
Is it possible to run only specific branch in case of branching
operator in middle?
Is it possible to resume Airflow DAG from last failure task?
If not possible how to manage large DAG's and avoid rerunning
redundant tasks?
Please provide me suggestions on how to implement this if possible.
You can't do it manually. If you set BranchPythonOperator you can skip tasks till the task you wish to start with according to the conditions set in the BranchPythonOperator
Same as 1.
yes. You can clear tasks upstream till root or down stream till all leaves of the node.
You can do something like:
Task A >> Task B >> Task C >> Task D
Task C >> Task E
Where C is the branch operator.
For example:
from datetime import date
def branch_func():
if date.today().weekday() == 0:
return 'task id of D'
else:
return 'task id of E'
Task_C = BranchPythonOperator(
task_id='branch_operation',
python_callable=branch_func,
dag=dag)
This will be task sequence on Monday :
Task A >> Task B >> Task C >> Task D
This will be task sequence on rest of the week:
Task A >> Task B >> Task C >> Task E

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

Resources