Create folder in webserver and scheduler via Airflow - airflow

When running the following operator a folder called my_folder is created in the Airflow worker. This makes sense since it is the worker executing the task.
run_this = BashOperator(
task_id='run_this',
bash_command=f'mkdir my_folder'
)
However, I'd like to create this folder both in the webserver and scheduler. Is there an easy way to do this?
My idea is to update other dags from a dag by copying them from s3, but being able to do this is a first step.

One thing that comes to mind is to mount a shared volume that all components can access: worker/scheduler/webserver and update your dag to create that folder in the shared volume like so:
run_this = BashOperator(
task_id='run_this',
bash_command=f'mkdir /shared_volume/my_folder'
)
How you do this is totally dependent on how you're deploying airflow.

Related

How to avoid DAG generation during task run

We have an Airflow python script which read configuration files and then generate > 100 DAGs dynamically. When running the script in Airflow 2.4.1, from the task run log, we notice that Airflow is trying to parse our python script for every task run.
https://github.com/apache/airflow/blob/2.4.1/airflow/task/task_runner/standard_task_runner.py#L91-L97
Is there any way to make Airflow deserialize DAGs from DBs instead?
just found out that it is an expected behavior
https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629
https://medium.com/apache-airflow/magic-loop-in-airflow-reloaded-3e1bd8fb6671
but the Python script may use parsing context to load the respective DAG only
https://github.com/apache/airflow/pull/25161

How do you create a triggerer process in an Airflow installation?

In an Airflow DAG, I am trying to use a TimeDeltaTrigger:
from airflow.triggers.temporal import TimeDeltaTrigger
...
self.defer(trigger=TimeDeltaTrigger(timedelta(seconds=15)), method_name="execute")
But when my DAG runs, I get a warning in the GUI:
In the GUI, if I go to Browse -> Triggers I see one trigger, but it is not for TimeDeltaTrigger:
The documentation for Deferrable Operators (https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html) says:
Ensure your Airflow installation is running at least one triggerer process, as well as the normal scheduler
But it is not clear how to do this.
How can I configure my Airflow installation so that I can use a TimeDeltaTrigger?
triggerer is a process like scheduler, webserver, and worker. You need to start a process or container dedicated to running the triggerer to use deferrable operators.
To start a triggerer process, run airflow triggerer in your Airflow environment. You should see an output similar to the below image.
Triggerer Logs

How to manage python packages between airflow dags?

If I have multiple airflow dags with some overlapping python package dependencies, how can I keep each of these project deps. decoupled? Eg. if I had project A and B on same server I would run each of them with something like...
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
Basically, would like to run dags with the same situation (eg. each dag uses python scripts that have may have overlapping package deps. that I would like to develop separately (ie. not have to update all code using a package when want to update the package just for one project)). Note, I've been using the BashOperator to run python tasks like...
do_stuff = BashOperator(
task_id='my_task',
bash_command='python /path/to/script.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
Is there a way to get this working? IS there some other best-practice way that airflow intendeds for people to address (or avoid) these kinds of problems?
Based on discussion from the apache-airflow mailing list, the simplest answer that addresses the modular way in which I am using various python scripts for tasks is to directly call virtualenv python interpreter binaries for each script or module, eg.
source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate
would translate to something like
do_stuff_a = BashOperator(
task_id='my_task_a',
bash_command='/path/to/virtualenv_a/bin/python /path/to/script_a.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
do_stuff_b = BashOperator(
task_id='my_task_b',
bash_command='/path/to/virtualenv_b/bin/python /path/to/script_b.py'),
execution_timeout=timedelta(minutes=30),
dag=dag)
in an airflow dag.
To the question of passing args to the Tasks, it depends on the nature of the args you want to pass in. In my case, there are certain args that depend on what a data table looks like on the day the dag is run (eg. highest timestamp record in the table, etc.). To add these args to the Tasks, I have a "congif dag" that runs before this one. In the config dag, there is a Task that generates the args for the "real" dag as a python dict and converts to a pickle file. Then the "config" dag has a Task that is a TriggerDagRunOperator that activates the "real" dag which has initial logic to read from the pickle file generated by the "config" dag (in my case, as a Dict) and I read it into that bash_command string like bash_command=f"python script.py {configs['arg1']}".
you can use packaged dags, where each dag is packaged with its dependency
http://airflow.apache.org/concepts.html#packaged-dags
There are operators for running Python. There is a relatively new one, the PythonVirtualenvOperator which will create an ephemeral virtualenv, install your dependencies, run your python, then tear down the environment. This does create some per-task overhead but is a functional (if not ideal) approach to your dependency overlap issue.
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator

Airflow not loading dags in /usr/local/airflow/dags

Airflow seems to be skipping the dags I added to /usr/local/airflow/dags.
When I run
airflow list_dags
The output shows
[2017-08-06 17:03:47,220] {models.py:168} INFO - Filling up the DagBag from /usr/local/airflow/dags
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
example_bash_operator
example_branch_dop_operator_v3
example_branch_operator
example_http_operator
example_passing_params_via_test_command
example_python_operator
example_short_circuit_operator
example_skip_dag
example_subdag_operator
example_subdag_operator.section-1
example_subdag_operator.section-2
example_trigger_controller_dag
example_trigger_target_dag
example_xcom
latest_only
latest_only_with_trigger
test_utils
tutorial
But this doesn't include the dags in /usr/local/airflow/dags
ls -la /usr/local/airflow/dags/
total 20
drwxr-xr-x 3 airflow airflow 4096 Aug 6 17:08 .
drwxr-xr-x 4 airflow airflow 4096 Aug 6 16:57 ..
-rw-r--r-- 1 airflow airflow 1645 Aug 6 17:03 custom_example_bash_operator.py
drwxr-xr-x 2 airflow airflow 4096 Aug 6 17:08 __pycache__
Is there some other condition that neededs to be satisfied for airflow to identify a DAG and load it?
My dag is being loaded but I had the name of the DAG wrong. I was expecting the dag to be named by the file but the name is determined by the first argument to the DAG constructor
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
Try airflow db init before listing the dags. This is because airflow list_dags lists down all the dags present in the database (And not in the folder you mentioned). Airflow initdb will create entry for these dags in the database.
Make sure you have environment variable AIRFLOW_HOME set to /usr/local/airflow. If this variable is not set, airflow looks for dags in the home airflow folder, which might not be existing in your case.
The example files are not in /usr/local/airflow/dags. You can simply mute them by edit airflow.cfg (usually in ~/airflow). set load_examples = False in 'core' section.
There are couple of errors may make your DAG not been listed in list_dags.
Your DAG file has syntax issue. To check this, just run python custom_example_bash_operator.py and see if any issue.
See if the folder is the default dag loading path. For a new bird, I suggest that just create a new .py file and copy the sample from here https://airflow.incubator.apache.org/tutorial.html then see if the testing dag shows up.
Make sure there is dag = DAG('dag_name', default_args=default_args) in the dag file.
dag = DAG(
dag_id='example_bash_operator',
default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60))
When a DAG is instantiated it pops up by the name you specify in the dag_id attribute. dag_id serves as a unique identifier for your DAG
It will be the case if airflow.cfg config is pointed to an incorrect path.
STEP 1: Go to {basepath}/src/config/
STEP 2: Open airflow.cfg file
STEP 3: Check the path it should point to the dags folder you have created
dags_folder = /usr/local/airflow/dags
I find that I have to restart the scheduler for the UI to pick up the new dags, When I make changes to a dag in my dags folder. I find that when I update the dags they appear in the list when I run airflow list_dags just not in the UI until I restart the scheduler.
First try running:
airflow scheduler
There can be two issues:
1. Check the Dag name given at the time of DAG object creation in the DAG python program
dag = DAG(
dag_id='Name_Of_Your_DAG',
....)
Note that many of the times the name given may be the same as the already present name in the list of DAGs (since if you copied the DAG code). If this is not the case then
2. Check the path set to the DAG folder in Airflow's config file.
You can create DAG file anywhere on your system but you need to set the path to that DAG folder/directory in Airflow's config file.
For example, I have created my DAG folder in the Home directory then I have to edit airflow.cfg file using the following commands in the terminal:
creating a DAG folder at home or root directory
$mkdir ~/DAG
Editing airflow.cfg present in the airflow directory where I have installed the airflow
~/$cd airflow
~/airflow$nano airflow.cfg
In this file change dags_folder path to DAG folder that we have created.
If you still facing the problem then reinstall the Airflow and refer this link for the installation of Apache Airflow.
Are your
custom_example_bash_operator.py
has a DAG name different from the others?
If yes, try restart the scheduler or even resetdb. I usually mistook the filename to be the dag name as well, so better to name them the same.
Can you share what is in custom_example_bash_operator.py? Airflow scans for certain magic inside a file to determine whether is a DAG or not. It scans for airflow and for DAG.
In addition if you are using a duplicate dag_id for a DAG it will be overwritten. As you seem to be deriving from the example bash operator did you keep the name of the DAG example_bash_operator maybe? Try renaming that.
You need to set airflow first and initialise the db
export AIRFLOW_HOME=/myfolder
mkdir /myfolder/dags
airflow db init
You need to create a user too
airflow users create \
--username admin \
--firstname FIRST_NAME \
--lastname LAST_NAME \
--role Admin \
--email admin#example.org
If you have done it correctly you should see airflow.cfg in your folder. There you will find dags_folder which shows the dags folder.
If you have saved your dag inside this folder you should see it in the dag lists
airflow dags list
, or using the UI with
airflow webserver --port 8080
Otherwise, run again airflow db init.
In my case, print(something) in dag file prevented printing dag list on command line.
Check if there is print line in your dag if above solutions are not working.
Try Restarting the scheduler. Scheduler needs to be restarted when new DAGS need to be added to the DAG Bag

How to run one airflow task and all its dependencies?

I suspected that
airflow run dag_id task_id execution_date
would run all upstream tasks, but it does not. It will simply fail when it sees that not all dependent tasks are run. How can I run a specific task and all its dependencies? I am guessing this is not possible because of an airflow design decision, but is there a way to get around this?
You can run a task independently by using -i/-I/-A flags along with the run command.
But yes the design of airflow does not permit running a specific task and all its dependencies.
You can backfill the dag by removing non-related tasks from the DAG for testing purpose
A bit of a workaround but in case you have given your tasks task_id-s consistently you can try the backfilling from Airflow CLI (Command Line Interface):
airflow backfill -t TASK_REGEX ... dag_id
where TASK_REGEX corresponds to the naming pattern of the task you want to rerun and its dependencies.
(remember to add the rest of the command line options, like --start_date).

Resources