I have my airflow project with the structure as below
airflow
|
|----dags
| |----dag.py
|
|----dbt-redshift
|----models
|----model.sql
I have included the dbt-redshift directory in the volumes section as
volumes:
-./dbt-redshift:/opt/airflow/dbt-redshift
And I'm trying to run the dbt inside the DAG using a bash operator
dbt_task = BashOperator(task_id='dbt', bash_command="cd ~/dbt-redshift && dbt run", dag=dag)
But when i execute the DAG i get the error
cd: /home/***/dbt-redshift no such file or directory
I'm not sure I understand how these directories are located inside the airflow project.
You are mounting the volume inside the container to /opt/airflow/dbt-redshift, but the BashOperator references ~/dbt-redshit with ~ resolving to /home/airflow.
(Assuming you are using the apache/airflow image)
Either change the command used by the BashOperator to reference /opt/airflow/dbt-redshift or change the volume to mount to the home directory.
Related
I've installed Airflow on AWS EKS, and things are working great.
Now I'm trying to add slack alert to my dags. My dag directory is like this:
So I tried to use alert.py by inserting this into DAGs
from utils.alert import SlackAlert
and web shows an error
Broken DAG: [/opt/airflow/dags/repo/dags/sample_slack.py] Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/airflow/dags/repo/dags/sample_slack.py", line 8, in <module>
from utils.alert import SlackAlert
ModuleNotFoundError: No module named 'utils'
How can I make my DAGs be able to import packages from utils folder?
Plus+:
I deployed Airflow on Docker Desktop K8s locally, and it works..
Plus++:
I'm using gitSyncSidecar with persistence enabled. In scheduler pod, I checked dags path. Now I see that there's auto-generated directory(maybe by gitSyncSidecar?).
$ kubectl exec --stdin --tty airflow-scheduler-fc9c56d9c-ltql7 -- /bin/bash
Defaulted container "scheduler" out of: scheduler, git-sync, scheduler-log-groomer, wait-for-airflow-migrations (init), git-sync-init (init)
airflow#airflow-scheduler-fc9c56d9c-ltql7:/opt/airflow$ cd dags/
.git/ c5d7d684141f605142885d429e10ec3d81ca745b/ repo/
airflow#airflow-scheduler-fc9c56d9c-ltql7:/opt/airflow$ cd dags/c5d7d684141f605142885d429e10ec3d81ca745b/dags/utils/
airflow#airflow-scheduler-fc9c56d9c-ltql7:/opt/airflow/dags/c5d7d684141f605142885d429e10ec3d81ca745b/dags/utils$ ls
__init__.py alert.py
So with this environment, if I want to do what I'm trying to do, I have to deploy airflow and check auto-generated directory's name, and use it in my dag like this?
from c5d7d684141f605142885d429e10ec3d81ca745b.dags.utils.alert import SlackAlert
I solved my question with the help of Oluwafemi Sule's question.
I went into pods and checked airflow.cfg, and default dags directory was set to '/opt/airflow/dags/repo' . Under there starts my git Repo. So I changed from 'import utills' to 'import dags.utils' and now it finds the module correctly.
We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS
I have bare bones airflow installation on conda - I managed to create custom operators by putting them in path:
airflow/dags/operators/custom_operator.py
then calling from dag as:
from operators.custom_operator import CustomOperator
how can I instead achieve folder structure:
airflow/operators/custom_operator.py
Which would be called from dag as:
from airflow.operators.custom_operator import CustomOperator
In case if you think that's a bad approach - please point it out in your answer/comment happy to tweak my approach, if there are better design patterns...
Interestingly - the solution here is in airflow.cfg (your airflow config file) to move parameter dags_folder one directory up - to $AIRFLOW_HOME, so instead of having:
....
[core]
dags_folder = /home/user/airflow/dags
....
Just make it:
....
[core]
dags_folder = /home/user/airflow
....
Airflow apparently will look recursivelly for dags, and capture only classes defined as dags... Whereas you can then keep clean folder structure, with custom operators, utility functions, custom sensors etc. outside dags/ folder.
I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error and not a module not found kind of error...
I tried to reproduce your issue and then try to solve it, so I created the same folder structure you already have:
/dags/
dataflow.py
dataflow/
pipeline.py -> pipeline
setup.py
my_modules/
__init__.py
common.py
Therefore, to make it work, the change I made is to copy these folders to a place where the instance is running the code is able to find it, for example in the /tmp/ folder of the instance.
So, my DAG would be something like this:
1 - Fist of all I declare my arguments:
default_args = {
'start_date': datetime(xxxx, x, x),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dataflow_default_options': {
'project': '<project>',
'region': '<region>',
'stagingLocation': 'gs://<bucket>/stage',
'tempLocation': 'gs://<bucket>/temp',
'setup_file': <setup.py>,
'runner': 'DataflowRunner'
}
}
2- After this, I created the DAG and before running the Dataflow task, I copied the whole folder directory, above created, into the /tmp/ folder of the instance Task t1, and after this, I run the pipeline from the /tmp/ directory Task t2:
with DAG(
'composer_df',
default_args=default_args,
description='datflow dag',
schedule_interval="xxxx") as dag:
def copy_dependencies():
process = subprocess.Popen(['gsutil','cp', '-r' ,'gs://<bucket>/dags/*',
'/tmp/'])
process.communicate()
t1 = python_operator.PythonOperator(
task_id='copy_dependencies',
python_callable=copy_dependencies,
provide_context=False
)
t2 = DataFlowPythonOperator(task_id="composer_dataflow",
py_file='/tmp/dataflow/pipeline.py', job_name='job_composer')
t1 >> t2
That's how I created the DAG file dataflow.py, and then, in the pipeline.py the package to import would be like:
from my_modules import commons
It should work fine, since the folder directory is understandable for the VM.
I am trying to package my Repository with my Dag in a Zip file like it states here in the documentation.
So i have followed the convention in the documentation, which is to keep the dag in the root of the zip, and the sub directories are viewed as packages by airflow.
My zip file has the following contents:
$ unzip -l $AIRFLOW_HOME/dags/test_with_zip.zip
Archive: /home/arjunc/Tutorials/airflow/dags/test_with_zip.zip
Length Date Time Name
--------- ---------- ----- ----
0 2018-03-29 17:46 helloworld/
189 2018-03-29 17:22 helloworld/hello.py
0 2018-03-29 17:18 helloworld/__init__.py
461 2018-03-29 17:24 test_with_zip_dag.py
--------- -------
650 4 files
Where test_with_zip_dag.py is the file in the root directory with the Dag definitions as follows:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from helloworld.hello import HelloWorld
def run():
return HelloWorld().run()
dag = DAG('test_with_zip', description='Test Dependencies With Zipping',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
hello_operator = PythonOperator(task_id='hello_task', python_callable=run, dag=dag)
I have placed this zip in the default dags directory $AIRFLOW_HOME/dags, but my dag isn't recognized!
What am I doing wrong?
Update
When I restarted the webserver, the task test_with_zip has popped up, but it is not runnable because the Scheduler doesn't seem to recognize it. I get the following error for it (from the web interface):
This DAG seems to be existing only locally. The master scheduler doesn't seem to be aware of its existence.
Which version of airflow are you on? Airflow 1.8.1 had problems with loading dags from zips. This issue was fixed in 1.8.3. https://issues.apache.org/jira/browse/AIRFLOW-1357
I recommend that you should update to the latest version of Airflow ie 1.9.0
You mention only to restart the webserver.
You also need to start the scheduler with airflow scheduler.
Also, see more steps to check here: Airflow 1.9.0 is queuing but not launching tasks
The DAG python file has to be in the root of the zip package. See https://airflow.apache.org/docs/stable/concepts.html#packaged-dags