airflow user impersonation (run_as_user) not working - airflow

I am trying to use run_as_user feature in airflow for our DAG and we are facing some issues. Any help or recommendations?
DAG Code:from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
current_time = datetime.now() - timedelta(days=1)
default_args = {
'start_date': datetime.strptime(current_time.strftime('%Y-%m-%d %H:%M:%S'),'%Y-%m-%d %H:%M:%S'),
'run_as_user': 'airflowaduser',
'execution_timeout': timedelta(minutes=5)
}
dag = DAG('test_run-as_user', default_args=default_args,description='Run hive Query DAG', schedule_interval='0 * * * *',)
hive_ex = BashOperator(
task_id='hive-ex',
bash_command='whoami',
dag=dag
)
i have airflow added to sudoers and it can switch to airflowaduser without password from Linux shell.
airflow ALL=(ALL) NOPASSWD: ALL
Error details below while running the DAG:
*** Reading local file: /home/airflow/logs/test_run-as_user/hive-ex/2020-06-09T16:00:00+00:00/1.log
[2020-06-09 17:00:04,602] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:838} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,613] {taskinstance.py:839} INFO - Starting attempt 1 of 1
[2020-06-09 17:00:04,613] {taskinstance.py:840} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,651] {taskinstance.py:859} INFO - Executing <Task(BashOperator): hive-ex> on 2020-06-09T16:00:00+00:00
[2020-06-09 17:00:04,651] {base_task_runner.py:133} INFO - Running: ['sudo', '-E', '-H', '-u', 'airflowaduser', 'airflow', 'run', 'test_run-as_user', 'hive-ex', '2020-06-09T16:00:00+00:00', '--job_id', '2314', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/test_run-as_user/testscript.py', '--cfg_path', '/tmp/tmpbinlgw54']
[2020-06-09 17:00:04,664] {base_task_runner.py:115} INFO - Job 2314: Subtask hive-ex sudo: airflow: command not found
[2020-06-09 17:00:09,576] {logging_mixin.py:95} INFO - [[34m2020-06-09 17:00:09,575[0m] {[34mlocal_task_job.py:[0m105} INFO[0m - Task exited with return code 1[0m
And our airflow runs in virtual environment.

When running airflow in a virtual environment, only the user 'airflow' is configured to run airflow commands. If you want to run as another user, you need to set the home directory to be the same as for the airflow user (/home/airflow) and have it belong to the 0 group. See [https://airflow.apache.org/docs/docker-stack/entrypoint.html#allowing-arbitrary-user-to-run-the-container]
Additionally, the run_as_user feature calls sudo, which is only allowed to use the secure path. The location of the airflow commands is not part of the secure path, but it can be added in the sudoers file. You can use whereis airflow to check where the airflow directory is, in my container it was /home/airflow/.local/bin
To solve this I needed to add 4 lines to my Dockerfile:
RUN useradd -u [airflowaduser UID] -g 0 -d /home/airflow kettle && \
# create airflowaduser
usermod -u [airflow UID] -aG sudo airflow && \
# add airflow to sudo group
echo "airflow ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
# allow airflow to run sudo without a password
sed -i 's#/.venv/bin#/home/airflow/.local/bin:/.venv/bin#' /etc/sudoers
# update secure path to include the airflow directory

Related

Airflow cant execute SparkSubmit

I am trying SparkSubmitOperator in Airflow
My job run by jar file, with config in config.properties by typesafe.ConfigFactory
My error is:
airflow.exceptions.AirflowException: Cannot execute: /home/hungnd/spark-2.4.3-bin-hadoop2.7/bin/spark-submit
--master yarn
--conf spark.executor.extraClassPath=file:///home/hungnd/airflow/dags/project/spark-example/config.properties
--files /home/hungnd/airflow/dags/project/spark-example/config.properties
--driver-class-path file:///home/hungnd/airflow/dags/project/spark-example/config.properties
--jars file:///home/hungnd/airflow/dags/project/spark-example/target/libs/*
--name arrow-spark
--class vn.vccorp.adtech.analytic.Main
--queue root.default
--deploy-mode client
/home/hungnd/airflow/dags/project/spark-example/target/spark-example-1.0-SNAPSHOT.jar
But I copy that command to server ubuntu, it run successfully
Please help me!! Thank

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

Airflow trigger_dag not working from command line

I have the following airflow setup (version v1.10.0).
default_args = {
'owner': 'Google Cloud Datalab',
'email': [],
'start_date': datetime.datetime.strptime('2019-07-04T00:00:00', '%Y-%m-%dT%H:%M:%S'),
'end_date': None,
'depends_on_past': False,
}
dag = DAG(dag_id='my_dag', schedule_interval='00 23 * * *', catchup=False, default_args=default_args)
The dag has been turned ON in the airflow UI. airflow scheduler is running too (ps -ef | grep "airflow scheduler" gives multiple hits).
I am trying to run this dag for a specific date in the past (2 days ago) using command line as
nohup airflow trigger_dag -e 2019-08-27 my_dag &
The command just exits after a few seconds. nohup.out shows the following in the end.
[2019-08-29 22:16:06,144] {cli.py:237} INFO - Created <DagRun my_dag # 2019-08-27 00:00:00+00:00: manual__2019-08-27T00:00:00+00:00, externally triggered: True>
This is a fresh airflow instance and there have been no previous runs of the dag. How can I get it to run for the specific date?

Multiple BashOperator in Airflow doesn't recognize the current folder

I am using Airflow to see if I can do the same work for my data ingestion, original ingestion is completed by two steps in shell:
cd ~/bm3
./bm3.py runjob -p projectid -j jobid
In Airflow, I have two tasks with BashOperator:
task1 = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3",
dag=dag)
task2 = BashOperator(
task_id='kickoff_bm3',
bash_command="./bm3.py runjob -p client1 -j ingestion",
dag=dag)
task1 >> task2
The task1 completed as expected, log below:
[2019-03-01 16:50:17,638] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmpkla8w_xd/switch2ALhomeelbcfbxb
[2019-03-01 16:50:17,638] {bash_operator.py:110} INFO - Running command: cd /home/rxie/al2
the task2 failed for the reason shown in log:
[2019-03-01 16:51:19,896] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmp328cvywu/kickoff_al2710f17lm
[2019-03-01 16:51:19,896] {bash_operator.py:110} INFO - Running command: ./bm32.py runjob -p client1 -j ingestion
[2019-03-01 16:51:19,902] {bash_operator.py:119} INFO - Output:
[2019-03-01 16:51:19,903] {bash_operator.py:123} INFO - /tmp/airflowtmp328cvywu/kickoff_al2710f17lm: line 1: ./bm3.py: No such file or directory
So it seems every task is executed from a seemly unique temp folder, which failed the second task.
How can I run the bash command from specific location?
Any thought is highly appreciated if you can share here.
Thank you very much.
UPDATE:
Thanks for the suggestion which almost works.
The bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion", works fine in the first place, however the runjob has multiple tasks in it, the first task works, and second task invoke impala-shell.py to run something, the impala-shell.py specifies python2 as its interpreter language while outside it, other parts are using python 3.
This is OK when I just run the bash_command in shell, but in Airflow, for unknown reason, despite I set the correct PATH and make sure in shell:
(base) (venv) [pchoix#hadoop02 ~]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
The task is still executed within python 3 and uses python 3, which is seen from the log:
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - File "/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/../lib/impala-shell/impala_shell.py", line 220
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - print '\tNo options available.'
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - ^
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - SyntaxError: Missing parentheses in call to 'print'
Note this issue doesn't exist when I run the job in shell environment:
./bm3.py runjob -p client1 -j ingestion
How about:
task = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion",
dag=dag)

Airflow will keep showing example dags even after removing it from configuration

Airflow example dags remain in the UI even after I have turned off load_examples = False in config file.
The system informs the dags are not present in the dag folder but they remain in UI because the scheduler has marked it as active in the metadata database.
I know one way to remove them from there would be to directly delete these rows in the database but off course this is not ideal.How should I proceed to remove these dags from UI?
There is currently no way of stopping a deleted DAG from being displayed on the UI except manually deleting the corresponding rows in the DB. The only other way is to restart the server after an initdb.
Airflow 1.10+:
Edit airflow.cfg and set load_examples = False
For each example dag run the command airflow delete_dag example_dag_to_delete
This avoids resetting the entire airflow db.
(Since Airflow 1.10 there is the command to delete dag from database, see this answer )
Assuming you have installed airflow through Anaconda.
Else look for airflow in your python site-packages folder and follow the below.
After you follow the instructions https://stackoverflow.com/a/43414326/1823570
Go to $AIRFLOW_HOME/lib/python2.7/site-packages/airflow directory
Remove the directory named example_dags or just rename it to revert back
Restart your webserver
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p [port-number]
Definitely airflow resetdb works here.
What I do is to create multiple shell scripts for various purposes like start webserver, start scheduler, refresh dag, etc. I only need to run the script to do what I want. Here is the list:
(venv) (base) [pchoix#hadoop02 airflow]$ cat refresh_airflow_dags.sh
#!/bin/bash
cd ~
source venv/bin/activate
airflow resetdb
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_scheduler.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow_webserver.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
(venv) (base) [pchoix#hadoop02 airflow]$ cat start_airflow.sh
#!/bin/bash
cd /home/pchoix
source venv/bin/activate
cd airflow
nohup airflow webserver >> "logs/web/$(date +'%Y%m%d%I%M%p').log" &
nohup airflow scheduler >> "logs/schd/$(date +'%Y%m%d%I%M%p').log" &
Don't forget to chmod +x to those scripts
I hope you find this helps.

Resources