Airflow trigger_dag not working from command line - airflow

I have the following airflow setup (version v1.10.0).
default_args = {
'owner': 'Google Cloud Datalab',
'email': [],
'start_date': datetime.datetime.strptime('2019-07-04T00:00:00', '%Y-%m-%dT%H:%M:%S'),
'end_date': None,
'depends_on_past': False,
}
dag = DAG(dag_id='my_dag', schedule_interval='00 23 * * *', catchup=False, default_args=default_args)
The dag has been turned ON in the airflow UI. airflow scheduler is running too (ps -ef | grep "airflow scheduler" gives multiple hits).
I am trying to run this dag for a specific date in the past (2 days ago) using command line as
nohup airflow trigger_dag -e 2019-08-27 my_dag &
The command just exits after a few seconds. nohup.out shows the following in the end.
[2019-08-29 22:16:06,144] {cli.py:237} INFO - Created <DagRun my_dag # 2019-08-27 00:00:00+00:00: manual__2019-08-27T00:00:00+00:00, externally triggered: True>
This is a fresh airflow instance and there have been no previous runs of the dag. How can I get it to run for the specific date?

Related

Airflow - Can't backfill via CLI

I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?

airflow user impersonation (run_as_user) not working

I am trying to use run_as_user feature in airflow for our DAG and we are facing some issues. Any help or recommendations?
DAG Code:from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
current_time = datetime.now() - timedelta(days=1)
default_args = {
'start_date': datetime.strptime(current_time.strftime('%Y-%m-%d %H:%M:%S'),'%Y-%m-%d %H:%M:%S'),
'run_as_user': 'airflowaduser',
'execution_timeout': timedelta(minutes=5)
}
dag = DAG('test_run-as_user', default_args=default_args,description='Run hive Query DAG', schedule_interval='0 * * * *',)
hive_ex = BashOperator(
task_id='hive-ex',
bash_command='whoami',
dag=dag
)
i have airflow added to sudoers and it can switch to airflowaduser without password from Linux shell.
airflow ALL=(ALL) NOPASSWD: ALL
Error details below while running the DAG:
*** Reading local file: /home/airflow/logs/test_run-as_user/hive-ex/2020-06-09T16:00:00+00:00/1.log
[2020-06-09 17:00:04,602] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:838} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,613] {taskinstance.py:839} INFO - Starting attempt 1 of 1
[2020-06-09 17:00:04,613] {taskinstance.py:840} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,651] {taskinstance.py:859} INFO - Executing <Task(BashOperator): hive-ex> on 2020-06-09T16:00:00+00:00
[2020-06-09 17:00:04,651] {base_task_runner.py:133} INFO - Running: ['sudo', '-E', '-H', '-u', 'airflowaduser', 'airflow', 'run', 'test_run-as_user', 'hive-ex', '2020-06-09T16:00:00+00:00', '--job_id', '2314', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/test_run-as_user/testscript.py', '--cfg_path', '/tmp/tmpbinlgw54']
[2020-06-09 17:00:04,664] {base_task_runner.py:115} INFO - Job 2314: Subtask hive-ex sudo: airflow: command not found
[2020-06-09 17:00:09,576] {logging_mixin.py:95} INFO - [[34m2020-06-09 17:00:09,575[0m] {[34mlocal_task_job.py:[0m105} INFO[0m - Task exited with return code 1[0m
And our airflow runs in virtual environment.
When running airflow in a virtual environment, only the user 'airflow' is configured to run airflow commands. If you want to run as another user, you need to set the home directory to be the same as for the airflow user (/home/airflow) and have it belong to the 0 group. See [https://airflow.apache.org/docs/docker-stack/entrypoint.html#allowing-arbitrary-user-to-run-the-container]
Additionally, the run_as_user feature calls sudo, which is only allowed to use the secure path. The location of the airflow commands is not part of the secure path, but it can be added in the sudoers file. You can use whereis airflow to check where the airflow directory is, in my container it was /home/airflow/.local/bin
To solve this I needed to add 4 lines to my Dockerfile:
RUN useradd -u [airflowaduser UID] -g 0 -d /home/airflow kettle && \
# create airflowaduser
usermod -u [airflow UID] -aG sudo airflow && \
# add airflow to sudo group
echo "airflow ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
# allow airflow to run sudo without a password
sed -i 's#/.venv/bin#/home/airflow/.local/bin:/.venv/bin#' /etc/sudoers
# update secure path to include the airflow directory

Is there any alternative to the "airflow scheduler" command on a terminal to be able to run scheduled DAGs on a VM without being "here"?

I am using Airflow on an Azure VM. I would like my DAG to run every day at midnight but I cannot always be connected to the VM at this time to run the "airflow scheduler" command. I would like my DAG to be able to run at midnight when the VM is on without my intervention.
I have tried to use the "run command" from the Azure VM manager but the run is limited to 90 minutes. I have also tried to explore the airflow.cfg file but I did not find anything.
Here are the configs of my DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 7, 29),
'retries': 1,
'email': ['n.haack#live.fr'],
'retry_delay': timedelta(minutes=2)}
dag = DAG(dag_id=DAG_ID,
default_args=default_args,
schedule_interval=timedelta(days=1))
I wonder if there is an alternative to the "airflow scheduler" command to be able to run DAGs or if there is a way to run a DAG on Airflow only with the server on
EDIT
There is a solution which is to run a bootstrap script with "airflow scheduler --daemon". This way, the bash command will run everytime you start your machine and the daemonization will keep it until it is shut down.
Thank you #Chengzhi
There is a solution which is to run a bootstrap script with "airflow scheduler --daemon". This way, the bash command will run everytime you start your machine and the daemonization will keep it until it is shut down. Thank you #Chengzhi
Elements of response here

For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?

I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli.
For example:
My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range).
So can I create such an airflow DAG, when it's scheduled, that the default time range is from 01:30 yesterday to 01:30 today. Then if anything wrong with the data source, I need to manually trigger the DAG and manually pass the time range as parameters.
As I know airflow test has -tp that can pass params to the task. But this is only for testing a specific task. and airflow trigger_dag doesn't have -tp option. So is there any way to tigger_dag and pass parameters to the DAG, and then the Operator can read these parameters?
Thanks!
You can pass parameters from the CLI using --conf '{"key":"value"}' and then use it in the DAG file as "{{ dag_run.conf["key"] }}" in templated field.
CLI:
airflow trigger_dag 'example_dag_conf' -r 'run_id' --conf '{"message":"value"}'
DAG File:
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag,
)
This should work, as per the airflow documentation: https://airflow.apache.org/cli.html#trigger_dag
airflow trigger_dag -c '{"key1":1, "key2":2}' dag_id
Make sure the value of -c is a valid json string, so the double quotes wrapping the keys are necessary here.
key: ['param1=somevalue1', 'param2=somevalue2']
First way:
"{{ dag_run.conf["key"] }}"
This will render the passed value as String "['param1=somevalue1', 'param2=somevalue2']"
Second way:
def get_parameters(self, **kwargs):
dag_run = kwargs.get('dag_run')
parameters = dag_run.conf['key']
return parameters
In this scenario, a list of strings is being passed and will be rendered as a list ['param1=somevalue1', 'param2=somevalue2']

How to exit with error from script to Airflow?

Say I'm running:
t = BashOperator(
task_id='import',
bash_command="""python3 script.py '{{ next_execution_date }}' """,
dag=dag)
And for some reason I want the script to exit with error and indicate airflow that he should retry this task.
I tried to use os._exit(1) but Airflow mark the task as success.
I know there is:
from airflow.exceptions import AirflowException
raise AirflowException("error msg")
But this is more for functions written in a DAG. My script is independent and sometimes we run it alone regardless of airflow.
Also the script is Python3 while Airflow is running under Python 2.7
It's seems excessive to install Airflow on Python3 just for error handling.
Is that any other solution?
Add || exit 1 at the end of your Bash command:
bash_command="""python3 script.py '{{ next_execution_date }}' || exit 1 """
More information: https://unix.stackexchange.com/a/309344

Resources