How deployment works with Airflow?

How deployment works with Airflow? - airflow

I'm using the Celery Executor and the setup from this dockerfile.
I'm deploying my dag into /usr/local/airflow/dags directory into the scheduler's container.
I'm able to run my dag with the command:
$ docker exec airflow_webserver_1 airflow backfill mydag -s 2016-01-01 -e 2016-02-01
My dag contains a simple bash operator:
BashOperator(command = "test.sh" ... )
The operator runs the test.sh script.
However if the test.sh refers to other files, like callme.sh, then I receive a "cannot find file" error.
e.g
$ pwd
/usr/local/airflow/dags/myworkflow.py
$ ls
myworkflow.py
test.sh
callme.sh
$ cat test.sh
echo "test file"
./callme.sh
$ cat callme.sh
echo "got called"
When running myworkflow, the task to call test.sh is invoked but fails for not finding the callme.sh.
I find this confusing. Is it my responsibility to share the code resource files with the worker or airflow's responsibility? If it's mine, then what is the recommended approach to do so? I'm looking at using EFS with it mounted on all container but it looks very expensive to me.

For celery executor, it is your responsibility to make sure that each worker has all the files it needs to run a job.

Related

How to run an existing shell script using airflow?

I wanted to run some existing bash scripts using airflow without modifying the code in the script itself. Is it possible without mentioning the shell commands in the script in a task?

Not entirely sure if understood your question, but you can load your shell commands into
Variables through Admin >> Variables menu as a json file.
And in your dag read the variable and pass as parameter into the BashOperator.
Airflow variables in more detail:
https://www.applydatascience.com/airflow/airflow-variables/
Example of variables file:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/config/example_variables.json
How to read the variable:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/example_variables.py
I hope this post helps you.

As long as the shell script is on the same machine that the Airflow Worker is running you can just call the shell script using the Bash Operator like the following:
t2 = BashOperator(
task_id='bash_example',
# Just call the script
bash_command="/home/batcher/test.sh ",
dag=dag)

you have to 'link' your local folder where your shell script is with worker, which means that you need to add volume in worker part of your docker-compose file..
so I added volume line under worker settings and worker now looks at this folder on your local machine:
airflow-worker:
<<: *airflow-common
command: celery worker
healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery#$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5
restart: always
volumes:
- /LOCAL_MACHINE_FOLDER/WHERE_SHELL_SCRIPT_IS:/folder_in_root_folder_of_worker

Multiple BashOperator in Airflow doesn't recognize the current folder

I am using Airflow to see if I can do the same work for my data ingestion, original ingestion is completed by two steps in shell:
cd ~/bm3
./bm3.py runjob -p projectid -j jobid
In Airflow, I have two tasks with BashOperator:
task1 = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3",
dag=dag)
task2 = BashOperator(
task_id='kickoff_bm3',
bash_command="./bm3.py runjob -p client1 -j ingestion",
dag=dag)
task1 >> task2
The task1 completed as expected, log below:
[2019-03-01 16:50:17,638] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmpkla8w_xd/switch2ALhomeelbcfbxb
[2019-03-01 16:50:17,638] {bash_operator.py:110} INFO - Running command: cd /home/rxie/al2
the task2 failed for the reason shown in log:
[2019-03-01 16:51:19,896] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmp328cvywu/kickoff_al2710f17lm
[2019-03-01 16:51:19,896] {bash_operator.py:110} INFO - Running command: ./bm32.py runjob -p client1 -j ingestion
[2019-03-01 16:51:19,902] {bash_operator.py:119} INFO - Output:
[2019-03-01 16:51:19,903] {bash_operator.py:123} INFO - /tmp/airflowtmp328cvywu/kickoff_al2710f17lm: line 1: ./bm3.py: No such file or directory
So it seems every task is executed from a seemly unique temp folder, which failed the second task.
How can I run the bash command from specific location?
Any thought is highly appreciated if you can share here.
Thank you very much.
UPDATE:
Thanks for the suggestion which almost works.
The bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion", works fine in the first place, however the runjob has multiple tasks in it, the first task works, and second task invoke impala-shell.py to run something, the impala-shell.py specifies python2 as its interpreter language while outside it, other parts are using python 3.
This is OK when I just run the bash_command in shell, but in Airflow, for unknown reason, despite I set the correct PATH and make sure in shell:
(base) (venv) [pchoix#hadoop02 ~]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
The task is still executed within python 3 and uses python 3, which is seen from the log:
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - File "/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/../lib/impala-shell/impala_shell.py", line 220
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - print '\tNo options available.'
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - ^
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - SyntaxError: Missing parentheses in call to 'print'
Note this issue doesn't exist when I run the job in shell environment:
./bm3.py runjob -p client1 -j ingestion

How about:
task = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion",
dag=dag)

how to clear failing DAGs using the CLI in airflow

I have some failing DAGs, let's say from 1st-Feb to 20th-Feb. From that date upword, all of them succeeded.
I tried to use the cli (instead of doing it twenty times with the Web UI):
airflow clear -f -t * my_dags.my_dag_id
But I have a weird error:
airflow: error: unrecognized arguments: airflow-webserver.pid airflow.cfg airflow_variables.json my_dags.my_dag_id
EDIT 1:
Like #tobi6 explained it, the * was indeed causing troubles.
Knowing that, I tried this command instead:
airflow clear -u -d -f -t ".*" my_dags.my_dag_id
but it's only returning failed task instances (-f flag). -d and -u flags don't seem to work because taskinstances downstream and upstream the failed ones are ignored (not returned).
EDIT 2:
like #tobi6 suggested, using -s and -e permits to select all DAG runs within a date range. Here is the command:
airflow clear -s "2018-04-01 00:00:00" -e "2018-04-01 00:00:00" my_dags.my_dag_id.
However, adding -f flag to the command above only returns failed task instances. is it possible to select all failed task instances of all failed DAG runs within a date range ?

If you are using an asterik * in the Linux bash, it will automatically expand the content of the directory.
Meaning it will replace the asterik with all files in the current working directory and then execute your command.
This will help to avoid the automatic expansion:
"airflow clear -f -t * my_dags.my_dag_id"

One solution I've found so far is by executing sql(MySQL in my case):
update task_instance t left join dag_run d on d.dag_id = t.dag_id and d.execution_date = t.execution_date
set t.state=null,
d.state='running'
where t.dag_id = '<your_dag_id'
and t.execution_date > '2020-08-07 23:00:00'
and d.state='failed';
It will clear all tasks states on failed dag_runs, as button 'clear' pressed for entire dag run in web UI.

In airflow 2.2.4 the airflow clear command was deprecated.
You could now run:
airflow tasks clear -s <your_start_date> -e <end_date> <dag_id>

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.

Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?

There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.

Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )

Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]

Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.