In composer (airflow 1.10.10), is it possible to create an airflow_local_settings.py file? And if so where should it be stored? I need this as I need an initContainer for my pod.
Add a airflow_local_settings.py file to your $PYTHONPATH or to $AIRFLOW_HOME/config folder.
For me, the above statement is unclear for Cloud-composer, as this config folder in an env bucket would probably not be synced with a worker.
Based on Slack discussions in the Apache Airflow Community Slack. It is not supported yet.
airflow_local_settings.py can be provided into dags folder in Composer GCS bucket according to Airflow initialization implementation where DAGs folder is added to sys.path before the configuration folder ($AIRFLOW_HOME/config). Keep in mind that by doing so you are overriding the Composer default policies.
Related
Where do you put your actual code? The dags must be thin, this assumes that when the task starts to run it would do the imports, and run some python code.
When we were on the standalone airflow I could add to the PYTHON_PATH my project root and do the imports from there, but in the AWS managed airflow I don't find any clues.
Put your DAGs into S3. Upon initialization of your MWAA environment, you will determine the S3 bucket containing your code.
E.g., create a bucket <my-dag-bucket> and place your DAGs in a subfolder dags
s3://<my-dag-bucket>/dags/
Also make sure to define all python dependencies in a requirements file and put that one in the same bucket as well:
s3://<my-dag-bucket>/requirements.txt
Finally, if you need to provide own modules, zip them up and put the zip file in the bucket, too:
s3://<my-dag-bucket>/plugins.zip
See https://docs.aws.amazon.com/mwaa/latest/userguide/get-started.html
I know airflow supports logging into S3/GCS/Azure etc.,
But is there a way to log into specific folders inside this storage based on some configuration inside the DAGs?
Airflow does not support this feature yet. There is a centralised log folder to be configured in airflow.cfg where all logs get saved irrespective of the dag
Do you guys have any recommended for Composer folder/directories structure? The way it should be structured is different from the way our internal Airflow server is using right now.
Based on Google documentation: https://cloud.google.com/composer/docs/concepts/cloud-storage:
plugins/: Stores your custom plugins, operators, hooks
dags/: store dags and any data the web server needs to parse a dag.
data/: Stores the data that tasks produce and use.
This is an example of how I organize my dags folder:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
It would be helpful if you guys can show me an example of how to structure the composer folder to support multiple projects with different dags, plugins and supported files.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
Thanks!
The impact on the scheduler should be fairly minimal as long as the files you place in the dags folder are not .py files; however, you can also place the files in the plugins folder which is also synced via copy.
I would use top-level folders to separate projects (e.g. dags/projectA/dagA.py), or even separate environments if the projects are large enough.
First question:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
You only need to set the correct configuration to read these files from the data/ or plugins/ folders. The difference between the filesystem between running airflow in composer against running it locally is that the path to these folders changes.
To help you with that, in another post I describe a solution to find the correct path for these folders. I quote my comment from the post:
"If the path I entered does not work for you, you will need to find the path for your Cloud Composer instance. This is not hard to find. In any DAG you could simply log the sys.path variable and see the path printed."
Second question:
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
Yes, at least the scheduler needs to check if these are python scripts or not. It's not really that much but it does have an impact.
If you solve the issue of reading from the data or plugins folders, you should move these files to those folders.
Third question:
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
If your different projects require different pypi packages, having separate repositories for each one and different airflow environments too would be ideal. The reason for this is that you'll leverage the risk of falling into pypi packages dependencies errors and reduce build times.
In the other hand, if your projects will use the same pypi packages, I'd suggest to keep everything in a single repository until it becomes easier to have every project in a different repo. But having everything in a single repo makes the deployment easier.
I'm trying to deploy Corda nodes in a windows server. While saving the corda app jar in plugins folder, which jar file do I have to save? Should I generate jar using intellij artifacts or just copy the plugins file from respective nodes which was created using gradlew deployNodes command?
If I've understood your question correctly, the easiest way would be to grab the jar(s) that are created inside the respective node's plugins folder after you've run deployNodes
I am very new to Airflow, I have set-up everything according to what are stated on their website. However I find it very confusing to figure out my dag folder location. NOTE: I configure **airflow.cfg (/airflow/dags) within this folder has two files.
/airflow/dags
---dag1.py
---dag2.py
But when I try to do airflow list_dags, it still shows the dags inside example_dags folder on
usr/local/lib/python2.7/dist_packages/airflow/example_dags
How can I see the path when I do airflow list_dags and to change it ? Any helps is appreciated.
There is an airflow.cfg value under the [core] section called load_examples that you can set to false to exclude the example DAGs. I think that should clean up the output you’re seeing from list_dags.