Airflow configuration recommendation for LocalExecutor - airflow

I am using airflow docker-compose from here and I have some performance issue along with strange behavior of airflow crashing.
First I have 5 DAGs running at the sametime, each one of them has 8 steps with max_active_runs=1:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
I would like to know what configuration should I use in order to maximize Airflow parallelism vs Stability. i.e.: I want to know what is the maximum recommanded [OPTIONS BELOW] for a machine that has X CPU and Y GB of RAM.
I am using a LocalExecutor but can't figure out how should I configure the parallelism:
AIRFLOW__SCHEDULER__SCHEDULER_MAX_THREADS=?
AIRFLOW__CORE__PARALLELISM=?
AIRFLOW__WEBSERVER__WORKERS=?
Is there a documentation that states the recommandation for each one of those based on your machine specification ?

I'm not sure you have a parallelism problem...yet.
Can you clarify something? You have 5 different dags with similar set-ups? Or this is launching five instances of the same task at once? I'd expect the former because of the max_active_runs setting.
On your task declaration here:
step1x
step2y
step3 >> step4 >> step8
step3 >> step5 >> step8
step3 >> step6 >> step8
step3 >> step7 >> step8
Are you expecting step1x, step2y and step3 to all execute at the same time? Then 4-7 and finally step8? What are you doing in the DAG where you need that kind of process vs 1-8 sequential?

Related

Airflow terminate EMR cluster

I am using EMR cluster to run some job to run in parallel. Both of these job run in same cluster. I have put action_on_failure field to 'CONTINUE' so that if 1 task fails, the other should run in the cluster. I want end task which is EMRTerminateCluster to run after both these tasks gets completeted regardless of success or failure.
task2
task1 >> >> task4
task3
I want my dags to run in such a way that task4 only starts after task 2 and task3.
is there any way to this?

Airflow retry of multiple tasks as a group

I have a group of tasks that should run as a unit, in the sense if any of the tasks from the group fail, the whole group should be marked as failed.
I want to be able to retry the group when it has failed.
For example, I have a DAG with these tasks:
taskA >> (taskB >> taskC) >> taskD
I want to say that (taskB >> taskC) is a group.
If either taskB or taskC fails, I want to be able to rerun the whole group (taskB >> taskC).
This is a two parts question.
First, In Airflow downstream task can not effect upstream task. Assuming structure of:
taskA >> taskB >> taskC >> taskD
then if taskB is successful and taskC failed. it can not change the state of taskB to failed.
Second, clearing (rerun) a TaskGroup is a feature that currently is not available. There is an open feature request for it in Airflow repo. You can view it in this link.

What is an alternative way to coordinate parallel tasks in airflow

I have recently started working with apache airflow and my dags and workflow run perfectly, however, I am looking for another way to coordinate the dependencies by excluding a task within the workflow.
The below code produces the following
start >> spark_job >> sql_job>> [getfile,getfile2] >> t2>> [renamefile,renamefile2] >> t8>> t9
>> t10>> end
I am looking for a way to exclude the sleep task where get_file connects directly to Renam_file2 and get_file2 connects to Renam_file
You cannot have dependencies between arrays but you can break down the dependencies to achieve direct connection between get_file to Rename_file.
I literally took what you mentioned in description, but are you sure you want to connect get_file to Renam_file2?
start >> spark_job >> sql_job >> [getfile, getfile2]
getfile >> renamefile2 # opposite??
getfile2 >> renamefile # opposite??
[renamefile, renamefile2] >> t8 >> t9 >> t10 >> end

Multiple BashOperator in Airflow doesn't recognize the current folder

I am using Airflow to see if I can do the same work for my data ingestion, original ingestion is completed by two steps in shell:
cd ~/bm3
./bm3.py runjob -p projectid -j jobid
In Airflow, I have two tasks with BashOperator:
task1 = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3",
dag=dag)
task2 = BashOperator(
task_id='kickoff_bm3',
bash_command="./bm3.py runjob -p client1 -j ingestion",
dag=dag)
task1 >> task2
The task1 completed as expected, log below:
[2019-03-01 16:50:17,638] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmpkla8w_xd/switch2ALhomeelbcfbxb
[2019-03-01 16:50:17,638] {bash_operator.py:110} INFO - Running command: cd /home/rxie/al2
the task2 failed for the reason shown in log:
[2019-03-01 16:51:19,896] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmp328cvywu/kickoff_al2710f17lm
[2019-03-01 16:51:19,896] {bash_operator.py:110} INFO - Running command: ./bm32.py runjob -p client1 -j ingestion
[2019-03-01 16:51:19,902] {bash_operator.py:119} INFO - Output:
[2019-03-01 16:51:19,903] {bash_operator.py:123} INFO - /tmp/airflowtmp328cvywu/kickoff_al2710f17lm: line 1: ./bm3.py: No such file or directory
So it seems every task is executed from a seemly unique temp folder, which failed the second task.
How can I run the bash command from specific location?
Any thought is highly appreciated if you can share here.
Thank you very much.
UPDATE:
Thanks for the suggestion which almost works.
The bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion", works fine in the first place, however the runjob has multiple tasks in it, the first task works, and second task invoke impala-shell.py to run something, the impala-shell.py specifies python2 as its interpreter language while outside it, other parts are using python 3.
This is OK when I just run the bash_command in shell, but in Airflow, for unknown reason, despite I set the correct PATH and make sure in shell:
(base) (venv) [pchoix#hadoop02 ~]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
The task is still executed within python 3 and uses python 3, which is seen from the log:
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - File "/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/../lib/impala-shell/impala_shell.py", line 220
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - print '\tNo options available.'
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - ^
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - SyntaxError: Missing parentheses in call to 'print'
Note this issue doesn't exist when I run the job in shell environment:
./bm3.py runjob -p client1 -j ingestion
How about:
task = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion",
dag=dag)

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

Resources