bitnami/airflow running kubectl command - airflow

How can I create a DAG to run the command kubectl version?
DAG file
...
bash_task =
BashOperator(
task_id="test_kubectl",
bash_command="kubectl version"
)
...
The logs of the airflow run shows
/bin/bash: line 1: kubectl: command not found
I am not sure if this is the right way to run the kubectl command

Related

Airflow cant execute SparkSubmit

I am trying SparkSubmitOperator in Airflow
My job run by jar file, with config in config.properties by typesafe.ConfigFactory
My error is:
airflow.exceptions.AirflowException: Cannot execute: /home/hungnd/spark-2.4.3-bin-hadoop2.7/bin/spark-submit
--master yarn
--conf spark.executor.extraClassPath=file:///home/hungnd/airflow/dags/project/spark-example/config.properties
--files /home/hungnd/airflow/dags/project/spark-example/config.properties
--driver-class-path file:///home/hungnd/airflow/dags/project/spark-example/config.properties
--jars file:///home/hungnd/airflow/dags/project/spark-example/target/libs/*
--name arrow-spark
--class vn.vccorp.adtech.analytic.Main
--queue root.default
--deploy-mode client
/home/hungnd/airflow/dags/project/spark-example/target/spark-example-1.0-SNAPSHOT.jar
But I copy that command to server ubuntu, it run successfully
Please help me!! Thank

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

Multiple BashOperator in Airflow doesn't recognize the current folder

I am using Airflow to see if I can do the same work for my data ingestion, original ingestion is completed by two steps in shell:
cd ~/bm3
./bm3.py runjob -p projectid -j jobid
In Airflow, I have two tasks with BashOperator:
task1 = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3",
dag=dag)
task2 = BashOperator(
task_id='kickoff_bm3',
bash_command="./bm3.py runjob -p client1 -j ingestion",
dag=dag)
task1 >> task2
The task1 completed as expected, log below:
[2019-03-01 16:50:17,638] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmpkla8w_xd/switch2ALhomeelbcfbxb
[2019-03-01 16:50:17,638] {bash_operator.py:110} INFO - Running command: cd /home/rxie/al2
the task2 failed for the reason shown in log:
[2019-03-01 16:51:19,896] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmp328cvywu/kickoff_al2710f17lm
[2019-03-01 16:51:19,896] {bash_operator.py:110} INFO - Running command: ./bm32.py runjob -p client1 -j ingestion
[2019-03-01 16:51:19,902] {bash_operator.py:119} INFO - Output:
[2019-03-01 16:51:19,903] {bash_operator.py:123} INFO - /tmp/airflowtmp328cvywu/kickoff_al2710f17lm: line 1: ./bm3.py: No such file or directory
So it seems every task is executed from a seemly unique temp folder, which failed the second task.
How can I run the bash command from specific location?
Any thought is highly appreciated if you can share here.
Thank you very much.
UPDATE:
Thanks for the suggestion which almost works.
The bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion", works fine in the first place, however the runjob has multiple tasks in it, the first task works, and second task invoke impala-shell.py to run something, the impala-shell.py specifies python2 as its interpreter language while outside it, other parts are using python 3.
This is OK when I just run the bash_command in shell, but in Airflow, for unknown reason, despite I set the correct PATH and make sure in shell:
(base) (venv) [pchoix#hadoop02 ~]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
The task is still executed within python 3 and uses python 3, which is seen from the log:
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - File "/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/../lib/impala-shell/impala_shell.py", line 220
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - print '\tNo options available.'
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - ^
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - SyntaxError: Missing parentheses in call to 'print'
Note this issue doesn't exist when I run the job in shell environment:
./bm3.py runjob -p client1 -j ingestion
How about:
task = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion",
dag=dag)

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

How deployment works with Airflow?

I'm using the Celery Executor and the setup from this dockerfile.
I'm deploying my dag into /usr/local/airflow/dags directory into the scheduler's container.
I'm able to run my dag with the command:
$ docker exec airflow_webserver_1 airflow backfill mydag -s 2016-01-01 -e 2016-02-01
My dag contains a simple bash operator:
BashOperator(command = "test.sh" ... )
The operator runs the test.sh script.
However if the test.sh refers to other files, like callme.sh, then I receive a "cannot find file" error.
e.g
$ pwd
/usr/local/airflow/dags/myworkflow.py
$ ls
myworkflow.py
test.sh
callme.sh
$ cat test.sh
echo "test file"
./callme.sh
$ cat callme.sh
echo "got called"
When running myworkflow, the task to call test.sh is invoked but fails for not finding the callme.sh.
I find this confusing. Is it my responsibility to share the code resource files with the worker or airflow's responsibility? If it's mine, then what is the recommended approach to do so? I'm looking at using EFS with it mounted on all container but it looks very expensive to me.
For celery executor, it is your responsibility to make sure that each worker has all the files it needs to run a job.

Resources