How to change Airflow's tmp data directory?

How to change Airflow's tmp data directory? - airflow

I'm currently running some disk-hungry tasks in Airflow 1.10.5, which are adding some large files (i.e. tmpnhre0_s8.__wz_cache) to the /tmp directory. As the root partition / is almost full, I would like to make Airflow use different directory, that lives in a larger partition, to store those temporary files.
How can I configure Airflow to use a different directory for the temporary files?

Same here, I'd like switch cache to redis backend. But it is still hardcoded https://github.com/apache/airflow/blob/master/airflow/www/app.py#L122
You can monkey patch it or, fork and contribute to the airflow.

Related

Cloud Composer/Airflow Task Runner Storage

I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).
In data pipelines we would:
Spawn a task runner,
Bootstrap it,
Do work,
Kill the task runner.
I just realized that my airflow runners are not ephemeral. I touched a file in /tmp, did it again in a separate DagRun, then listed /tmp and found two files. I expected only the one I most recently touched.
This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.
I know GCS mounts the /data folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?
Thanks for the advice.

Cloud Composer currently uses CeleryExecutor, which configures persistent worker processes that handle the execution of task instances. As you have discovered, you can make changes to the filesystems of the Airflow workers (which are Kubernetes pods), and they will indeed persist until the pod is restarted/replaced.
Best-practice wise, you should consider the local filesystem to be ephemeral to the task instance's lifetime, but you shouldn't expect that it will clean up for you. If you have tasks that perform heavy I/O, you should perform them outside of /home/airflow/gcs because that path is network mounted (GCSFUSE), but if there is final data you want to persist, then you should write it to /data.

Airflow/Composer recommended folder structure

Do you guys have any recommended for Composer folder/directories structure? The way it should be structured is different from the way our internal Airflow server is using right now.
Based on Google documentation: https://cloud.google.com/composer/docs/concepts/cloud-storage:
plugins/: Stores your custom plugins, operators, hooks
dags/: store dags and any data the web server needs to parse a dag.
data/: Stores the data that tasks produce and use.
This is an example of how I organize my dags folder:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
It would be helpful if you guys can show me an example of how to structure the composer folder to support multiple projects with different dags, plugins and supported files.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
Thanks!

The impact on the scheduler should be fairly minimal as long as the files you place in the dags folder are not .py files; however, you can also place the files in the plugins folder which is also synced via copy.
I would use top-level folders to separate projects (e.g. dags/projectA/dagA.py), or even separate environments if the projects are large enough.

First question:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
You only need to set the correct configuration to read these files from the data/ or plugins/ folders. The difference between the filesystem between running airflow in composer against running it locally is that the path to these folders changes.
To help you with that, in another post I describe a solution to find the correct path for these folders. I quote my comment from the post:
"If the path I entered does not work for you, you will need to find the path for your Cloud Composer instance. This is not hard to find. In any DAG you could simply log the sys.path variable and see the path printed."
Second question:
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
Yes, at least the scheduler needs to check if these are python scripts or not. It's not really that much but it does have an impact.
If you solve the issue of reading from the data or plugins folders, you should move these files to those folders.
Third question:
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
If your different projects require different pypi packages, having separate repositories for each one and different airflow environments too would be ideal. The reason for this is that you'll leverage the risk of falling into pypi packages dependencies errors and reduce build times.
In the other hand, if your projects will use the same pypi packages, I'd suggest to keep everything in a single repository until it becomes easier to have every project in a different repo. But having everything in a single repo makes the deployment easier.

Using rsync to deploy code updates for Symfony application

I have a couple of development machines that I code my changes on and one production server where I have deployed my Symfony application. Currently my deployment process is tedious and consists of the following workflow:
Determine the files changed in the last commit:
svn log -v -r HEAD
FTP those files to the server as the regular user
As root manually copy those files to their destination and, if required because the file is new, change the owner to the apache user
The local user does not have access to the apache directories which is why I must use root. I'm always worried that something will go wrong either due to a forgotten file during the FTP or the copy to the apache src directory.
I was thinking that instead I should FTP the entire Symfony app/ and src/ directories along with composer.json to the server as the regular user then come up with a script using rsync to sync all of the files.
New workflow would be:
FTP app/ src/ composer.json to the server in the local user's project directory
Run the sync script to sync the files
clear the cache
Is this a good solution or is there something better for Symfony projects?
This question is similar and gives an example of the rsync, but the pros and cons of this method are not discussed. Ideally I'd like to get the method that is the most reliable and easy to setup preferably without the need to install new software.

Basically every automated solution would be better than rsync or ftp. There are multiple things to do as you have mentioned: copy files, clear cache, run migrations, generate assets, list goes on.
Here you will find list of potential solutions.
http://symfony.com/doc/current/cookbook/deployment/tools.html#using-build-scripts-and-other-tools
From my experience with symfony I can recommend capifony, it takes a while to understand it, but it pays off

Deploying 2 apps from the same git using CodeDeploy

We have an app with web and worker nodes - the code for both is in the same git but gets deployed to different autoscaling groups. The problem is that there is only one appspec file, however the deployment scripts (AfterInstall, AppStart, etc.) for the web/worker nodes are different. How would I go about setting my CodeDeploy to deploy both apps and execute different deployment scripts ?
(Right now we have an appspec file that just invokes chef recipes that execute different actions based on the role of the node)

I know this question is very old, but I had the same question/issue recently and found an easy way to make it work.
I have added two appspec files on the same git: appspec-staging.yml, appspec-storybook.yml.
Also added two buildspec files buildspec-staging.yml, buildspec-storybook.yml(AWS CodeBuild allows specify the buildspec file).
The idea is after the build is done, we will copy and rename the specific appspec-xx.yml file to the final appspec.yml file, so finally, in the stage of CodeDeploy, we will have a proper appspec.yml file to deploy. Below command is for linux environment.
post_build:
commands:
- mv appspec-staging.yml appspec.yml

Update - according to an Amazon technical support representative it is not possible.
They recommend having separate gits for different environments (prod,staging,dev,etc.) and different apps.
Makes it harder to share code, but probably doable.

You can make use of environment variables exposed by the agent in your deployment scripts to identify which deployment group is being deployed.
Here's how you can use them https://blogs.aws.amazon.com/application-management/post/Tx1PX2XMPLYPULD/Using-CodeDeploy-Environment-Variables
Thanks,
Surya.

The way I have got around this is to have an appspec.yml.web and an appspec.yml.worker in the root of the project. I then have two jobs in Jenkins; one each that correspond to the worker and the web deployments. In each, it renames the appropriate one to be just appspec.yml and does the bundling to send to codedeploy.

Use Fossil for system files?

As a new user of Fossil, I'm curious if there are any negative implications with using Fossil to store things like /etc/, /usr/local/etc files from Unix like systems like FreeBSD & OpenBSD. If I'm doing this for multiple systems, I think I'd create a branch with each hostname to track those files.
Q1: Have you done this? Do you prefer a different VCS to handle the system files?
Q2: Lots of changes have happened in Fossil over the years and I'm curious if it's possible to restrict who can merge branches with trunk. From reading earlier threads it wasn't possible but there are two workarounds:
a) tell people not to merge to trunk
b) have people clone and trunk maintainer pick up changes from their repo

System configuration files stored in /etc, /var or /usr/local/etc can generally only edited by the root user. But since root has complete access to the whole system, a mistaken command there can have dire consequences.
For that reason I generally use another location to keep edited configuration files, a directory in my home-directory that I call setup, which is under control of git. Since I have multiple machines running FreeBSD, each machine gets its own subdirectory. There is a special subdirectory of setup called shared for those configuration files that are used on multiple machines. Maintaining multiple copies of identical files in separate repositories or even branches can be a lot of extra work.
My workflow is the following;
Edit a configuration file in my repository.
Copy it to its proper location.
Test the changes. If problems occur, go back to step 1.
Commit the changes to the revision control system. Copy the
committed files to their proper location.
Initially I had a shell script (basically a list of install commands) to install the files for me. But I also wanted to see the differences between the working tree and the installed files.
So for my convenience, I wrote a script called deploy to help me with this. It can tell me which files in the repo are different from the installed files and can show me the differences. It can also install files to their proper locations.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex