I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).
In data pipelines we would:
Spawn a task runner,
Bootstrap it,
Do work,
Kill the task runner.
I just realized that my airflow runners are not ephemeral. I touched a file in /tmp, did it again in a separate DagRun, then listed /tmp and found two files. I expected only the one I most recently touched.
This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.
I know GCS mounts the /data folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?
Thanks for the advice.
Cloud Composer currently uses CeleryExecutor, which configures persistent worker processes that handle the execution of task instances. As you have discovered, you can make changes to the filesystems of the Airflow workers (which are Kubernetes pods), and they will indeed persist until the pod is restarted/replaced.
Best-practice wise, you should consider the local filesystem to be ephemeral to the task instance's lifetime, but you shouldn't expect that it will clean up for you. If you have tasks that perform heavy I/O, you should perform them outside of /home/airflow/gcs because that path is network mounted (GCSFUSE), but if there is final data you want to persist, then you should write it to /data.
Related
When handling multiple environments (such as Dev/Staging/Prod etc) having separate (preferably identical) Airflow instances for each of these environments would be the best case scenario.
I'm using the GCP managed Airflow ( GCP Cloud Composer), which is not very cheap to run, and having multiple instances would increase our monthly bill significantly.
So, I'd like to know if anyone has recommendations on using a single Airflow instance to handle multiple environments?
One approach I was considering of was to have separate top-level folders within my dags folder corresponding to each of the environment (i.e. dags/dev, dags/prod etc)
and copy my DAG scripts to the relevant folder through the CI/CD pipeline.
So, within my source code repository if my dag looks like:
airflow_dags/my_dag_A.py
During the CI stage, I could have a build step that creates 2 separate versions of this file:
airflow_dags/DEV/my_dag_A.py
airflow_dags/PROD/my_dag_A.py
I would follow a strict naming convention for naming my DAGs, Airflow Variables etc to reflect the environment name, so that the above build step can automatically rename those accordingly.
I wanted check if this is an approach others may have used? Or are there any better/alternative suggestions?
Please let me know if any additional clarifications are needed.
Thank you in advance for your support. Highly appreciated.
I think it can be a good approach to have a shared environement because it's cost effective.
However if you have a Composer cluster per environment, it's simpler to manage, and it's allows having a better separation.
If you stay on a shared environment, I think you are on the good direction with a separation on the Composer bucket DAG and a folder per environment.
If you use Airflow variables, you also have to deal with environment in addition to the DAGs part.
You can then manage the access to each folder in the bucket.
In my team, we chose another approach.
Cloud Composer uses GKE with autopilot mode and it's more cost effective than the previous version.
It's also easier to manage the environement size of the cluster and play with differents parameters (workers, cpu, webserver...).
In our case, we created a cluster per environment but we have a different configuration per environment (managed by Terraform):
For dev and uat envs, we have a little sizing and an environment size as small
For prod env, we have a higher sizing and an environment size as Medium
It's not perfect but this allows us to have a compromise between cost and separation.
I have an Airflow instance in a local Ubuntu machine. This instance doesn't work very well, so I would like to install it again. The problem is that I can't delete the current instance, because it is used by other people, so I would like to create a new Airflow instance in the same machine to put various dags there.
How could I do it? I created a different virtual environment, but I don't know how to install a second airflow server in that environment, which works in parallel with the current airflow.
Thank you!
use different port for webserver
use different AIRFLOW_HOME variable
use different sql_alchemy_conn (to point to a different database)
copy the deployment you have to start/stop your airflow components.
Depending on your deployment you might somehew record process id of your running airflow (so called pid-files) or have some other way to determine which processes are running. But that is nothing airflow-specific, this is something that is specific for your deployment.
I have a small Node.js / Express app deployed to Heroku.
I'd like to use a lightweight database like NeDB to persist some data. Is it possible to periodically backup / copy a file from Heroku if I used this approach?
File-based databases aren't a good fit for Heroku due to its ephemeral filesystem (bold added):
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted. For example, this occurs any time a dyno is replaced due to application deployment and approximately once a day as part of normal dyno management.
Depending on your use case I recommend using a client-server database (this looks like a good fit here) or something like Amazon S3 for file storage.
There are several R scripts that need to be run periodically. Currently, i am having an EC2 instance where these R scripts are running through Cron jobs. However, this is not cost efficient as the scripts do not run all the time.
I am looking for a service that lets me deploy the R scripts and schedule them, only paying per use. Something like for instance AWS Lambda does.
Note: Rewrite these scripts is not a solution for now, since there are many and I do not have the resources for it.
Any ideas or suggestions about it?
You can containerize your scripts and try to run them on ECS with a cron schedule.
Quick search can give you plenty of examples on dockerizing R scripts, like this.
You can push your resulting images to AWS ECR, which is docker registry, and use images to define ECR tasks: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
After that you can run your tasks on schedule.
This way scripts will only consume compute power while working. It still requires some refactoring in form of containerization, but after you do this once it should scale to all other scripts.
If containerization is still too much work, you can combine EC2 instance scheduler with reserved scheduled instances for savings, but be aware that reserved instances have a lot of limitations if you plan on savings.
What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.