Delete a DAG in google composer - Airflow UI - airflow

I want to delete a DAG from the Airflow UI, that's not longer available in the GCS/dags folder.
I know that Airflow has a "new" way to remove dags from the DB using
airflow delete_dag my_dag_id command, seen in https://stackoverflow.com/a/49683543/5318634
It seems that in composer airflow version the delete_dag command is not yet supported.
Do not try this: I've also tried using airflow resetdb and the airflow UI died
Is there a way to delete the dags that are not currently in the gs://BUCKET/dags/ folder ?

I created a DAG to cleanup UI. It reads afDagID from Airflow Variables
from airflow import DAG
from airflow import models
from airflow.operators.mysql_operator import MySqlOperator
import logging
from datetime import datetime
from airflow.operators.dummy_operator import DummyOperator
dag = DAG('ManageAirFlow', description='Deletes Airflow DAGs from backend: Uses vars- afDagID',
schedule_interval=None,
start_date=datetime(2018, 3, 20), catchup=False)
DeleteXComOperator = MySqlOperator(
task_id='delete-xcom-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from xcom where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteTaskOperator = MySqlOperator(
task_id='delete-task-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from task_instance where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteSLAMissOperator = MySqlOperator(
task_id='delete-sla-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from sla_miss where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteLogOperator = MySqlOperator(
task_id='delete-log-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from log where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteJobOperator = MySqlOperator(
task_id='delete-job-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from job where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteDagRunOperator = MySqlOperator(
task_id='delete-dag_run-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from dag_run where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteDagOperator = MySqlOperator(
task_id='delete-dag-record-task',
mysql_conn_id='airflow_db',
sql="DELETE from dag where dag_id='{}'".format(models.Variable.get('afDagID')),
dag=dag)
DeleteXComOperator >> DeleteTaskOperator >> DeleteSLAMissOperator >> DeleteLogOperator >> DeleteJobOperator >> DeleteDagRunOperator >> DeleteDagOperator

As cloud composer is using the latest stable version i.e. 1.9.0, the feature to delete dag is not available.
However,
There are few instructions in the docs to delete a dag as below:
gcloud beta composer environments storage dags delete \
--environment ENVIRONMENT_NAME \
--location LOCATION \
DAG_NAME.py
but unfortunately, this would not remove the DAG from the Airflow web interface.
More info: https://cloud.google.com/composer/docs/how-to/using/managing-dags#deleting_a_dag

It will be take place in 2 steps:-
step1:-
First you have to delete airflow_monitoring.py file with command from storage bucket.
gcloud composer environments storage dags delete --environment viu-etl-prod-composer --location us-central1 airflow_monitoring.py
step 2:-
click on red cross check as shown in picture.

Related

Airflow CLI: How to get status of dag tasks in Airflow 1.10.12?

In Airflow 2.0, you can get the status of tasks in a dag by running CLI command: airflow tasks states-for-dag-run. (See docs here: https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#state_repeat1)
What's the equivalent in Airflow 1.10.12? I can't seem to find it in the 1.10.12 docs.
There is no direct equivalent as this is a new CLI command of Airflow 2.0.
In Airflow 1.10.12 you can do (docs):
airflow task_state [-h] [-sd SUBDIR] dag_id task_id execution_date

Airflow find DAG Runs with specific conf

I'm using apache-airflow 1.10.10.
My use case is: I have 4 DAGs, all of them triggers a common DAG named "dag_common" with different conf parameters using BashOperator after some work.
airflow trigger_dag -c {"id":"1"} dag_common
airflow trigger_dag -c {"id":"2"} dag_common
airflow trigger_dag -c {"id":"3"} dag_common
airflow trigger_dag -c {"id":"4"} dag_common
Inside these DAGs I have to wait for the triggered DAG to finish, how can I accomplish this?
Dag1 has to wait until finish dag_common with conf id=1.
Is there any way to find all dag runs with specific conf?
It looks like a use case for SubDAGs: Implement dag_common as a subDAG and use SubDagOperator() in those four DAGs to run dag_common.

airflow user impersonation (run_as_user) not working

I am trying to use run_as_user feature in airflow for our DAG and we are facing some issues. Any help or recommendations?
DAG Code:from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
current_time = datetime.now() - timedelta(days=1)
default_args = {
'start_date': datetime.strptime(current_time.strftime('%Y-%m-%d %H:%M:%S'),'%Y-%m-%d %H:%M:%S'),
'run_as_user': 'airflowaduser',
'execution_timeout': timedelta(minutes=5)
}
dag = DAG('test_run-as_user', default_args=default_args,description='Run hive Query DAG', schedule_interval='0 * * * *',)
hive_ex = BashOperator(
task_id='hive-ex',
bash_command='whoami',
dag=dag
)
i have airflow added to sudoers and it can switch to airflowaduser without password from Linux shell.
airflow ALL=(ALL) NOPASSWD: ALL
Error details below while running the DAG:
*** Reading local file: /home/airflow/logs/test_run-as_user/hive-ex/2020-06-09T16:00:00+00:00/1.log
[2020-06-09 17:00:04,602] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:620} INFO - Dependencies all met for <TaskInstance: test_run-as_user.hive-ex 2020-06-09T16:00:00+00:00 [queued]>
[2020-06-09 17:00:04,613] {taskinstance.py:838} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,613] {taskinstance.py:839} INFO - Starting attempt 1 of 1
[2020-06-09 17:00:04,613] {taskinstance.py:840} INFO -
--------------------------------------------------------------------------------
[2020-06-09 17:00:04,651] {taskinstance.py:859} INFO - Executing <Task(BashOperator): hive-ex> on 2020-06-09T16:00:00+00:00
[2020-06-09 17:00:04,651] {base_task_runner.py:133} INFO - Running: ['sudo', '-E', '-H', '-u', 'airflowaduser', 'airflow', 'run', 'test_run-as_user', 'hive-ex', '2020-06-09T16:00:00+00:00', '--job_id', '2314', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/test_run-as_user/testscript.py', '--cfg_path', '/tmp/tmpbinlgw54']
[2020-06-09 17:00:04,664] {base_task_runner.py:115} INFO - Job 2314: Subtask hive-ex sudo: airflow: command not found
[2020-06-09 17:00:09,576] {logging_mixin.py:95} INFO - [[34m2020-06-09 17:00:09,575[0m] {[34mlocal_task_job.py:[0m105} INFO[0m - Task exited with return code 1[0m
And our airflow runs in virtual environment.
When running airflow in a virtual environment, only the user 'airflow' is configured to run airflow commands. If you want to run as another user, you need to set the home directory to be the same as for the airflow user (/home/airflow) and have it belong to the 0 group. See [https://airflow.apache.org/docs/docker-stack/entrypoint.html#allowing-arbitrary-user-to-run-the-container]
Additionally, the run_as_user feature calls sudo, which is only allowed to use the secure path. The location of the airflow commands is not part of the secure path, but it can be added in the sudoers file. You can use whereis airflow to check where the airflow directory is, in my container it was /home/airflow/.local/bin
To solve this I needed to add 4 lines to my Dockerfile:
RUN useradd -u [airflowaduser UID] -g 0 -d /home/airflow kettle && \
# create airflowaduser
usermod -u [airflow UID] -aG sudo airflow && \
# add airflow to sudo group
echo "airflow ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
# allow airflow to run sudo without a password
sed -i 's#/.venv/bin#/home/airflow/.local/bin:/.venv/bin#' /etc/sudoers
# update secure path to include the airflow directory

Airflow trigger_dag not working from command line

I have the following airflow setup (version v1.10.0).
default_args = {
'owner': 'Google Cloud Datalab',
'email': [],
'start_date': datetime.datetime.strptime('2019-07-04T00:00:00', '%Y-%m-%dT%H:%M:%S'),
'end_date': None,
'depends_on_past': False,
}
dag = DAG(dag_id='my_dag', schedule_interval='00 23 * * *', catchup=False, default_args=default_args)
The dag has been turned ON in the airflow UI. airflow scheduler is running too (ps -ef | grep "airflow scheduler" gives multiple hits).
I am trying to run this dag for a specific date in the past (2 days ago) using command line as
nohup airflow trigger_dag -e 2019-08-27 my_dag &
The command just exits after a few seconds. nohup.out shows the following in the end.
[2019-08-29 22:16:06,144] {cli.py:237} INFO - Created <DagRun my_dag # 2019-08-27 00:00:00+00:00: manual__2019-08-27T00:00:00+00:00, externally triggered: True>
This is a fresh airflow instance and there have been no previous runs of the dag. How can I get it to run for the specific date?

For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?

I use Airflow to manage ETL tasks execution and schedule. A DAG has been created and it works fine. But is it possible to pass parameters when manually trigger the dag via cli.
For example:
My DAG runs every day at 01:30, and processes data for yesterday(time range from 01:30 yesterday to 01:30 today). There might be some issues with the data source. I need to re-process those data (manually specify the time range).
So can I create such an airflow DAG, when it's scheduled, that the default time range is from 01:30 yesterday to 01:30 today. Then if anything wrong with the data source, I need to manually trigger the DAG and manually pass the time range as parameters.
As I know airflow test has -tp that can pass params to the task. But this is only for testing a specific task. and airflow trigger_dag doesn't have -tp option. So is there any way to tigger_dag and pass parameters to the DAG, and then the Operator can read these parameters?
Thanks!
You can pass parameters from the CLI using --conf '{"key":"value"}' and then use it in the DAG file as "{{ dag_run.conf["key"] }}" in templated field.
CLI:
airflow trigger_dag 'example_dag_conf' -r 'run_id' --conf '{"message":"value"}'
DAG File:
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag,
)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag,
)
This should work, as per the airflow documentation: https://airflow.apache.org/cli.html#trigger_dag
airflow trigger_dag -c '{"key1":1, "key2":2}' dag_id
Make sure the value of -c is a valid json string, so the double quotes wrapping the keys are necessary here.
key: ['param1=somevalue1', 'param2=somevalue2']
First way:
"{{ dag_run.conf["key"] }}"
This will render the passed value as String "['param1=somevalue1', 'param2=somevalue2']"
Second way:
def get_parameters(self, **kwargs):
dag_run = kwargs.get('dag_run')
parameters = dag_run.conf['key']
return parameters
In this scenario, a list of strings is being passed and will be rendered as a list ['param1=somevalue1', 'param2=somevalue2']

Resources