Docker-compose file of Airflow with DaskExecutor - airflow

Can someone provide a YAML file of the same mentioned above? I need it for a project.
I am trying to execute my tasks parallelly on each core of the workers, as it provides a performance boost for the task.
To achieve this I want to execute my Airflow tasks directly on the Dask cluster. As my project requires Airflow to be run on docker, I couldn't find any docker-compose.yaml file for Airflow with DaskExecutor.
Dask generally has a scheduler and some workers in its cluster.
Apart from this, I've tried to achieve this task parallelism with the airflow-provider-ray library by Astronomer registry. I've used this documentation to achieve so in docker. But I am facing OSError: Connection timeout. Here I am running airflow in docker and ray cluster on my local python environment.
Secondly, I've tried the same with the dask cluster. In this, there is Airflow running on docker with celery executor, and in another docker, there is Dask scheduler, two workers, and a notebook. Then I am able to connect these but I keep getting error - ModuleNotFoundError: No module named 'unusual_prefix_2774d32fcb40c2ba2f509980b973518ede2ad0c3_dask_client'.
The solution to any of these problems will be appreciated.

Related

Apache airflow: install two instances in the same local machine

I have an Airflow instance in a local Ubuntu machine. This instance doesn't work very well, so I would like to install it again. The problem is that I can't delete the current instance, because it is used by other people, so I would like to create a new Airflow instance in the same machine to put various dags there.
How could I do it? I created a different virtual environment, but I don't know how to install a second airflow server in that environment, which works in parallel with the current airflow.
Thank you!
use different port for webserver
use different AIRFLOW_HOME variable
use different sql_alchemy_conn (to point to a different database)
copy the deployment you have to start/stop your airflow components.
Depending on your deployment you might somehew record process id of your running airflow (so called pid-files) or have some other way to determine which processes are running. But that is nothing airflow-specific, this is something that is specific for your deployment.

Unable to create dag with apache airflow

I am running airflow 2.0, setting up airflow dag for the first time, and following quick start tutorials.
After creating and running the py file I don't see the dag created it does not list for me.
setting:
airflow.cfg:dags_folder = /Users/vik/src/airflow/dags
my python file is in this folder. There are no errors here.
I am able to see the example dags in example-dags.
I did airflow db init
airflow webserver
airflow scheduler
then try to list the dags
I think I am missing something
I don't know exactly how you installed everything, but I highly recommend Astronomer CLI for simplicity and quick setup. With that you'll be able to setup a first DAG pretty quickly. Here is also the video tutorial that helps you understand how to install / setup everything.
A few things to try:
Make sure the scheduleris running (run airflow scheduler) or try to restart it .
Using the Airflow CLI, run airflow config list and make sure that the loaded config is in fact what you are expecting, check the value of dags_folder.
Try running airflow dags list from the CLI, and check the the filepath if your dag is shown in the results.
If there was an error parsing your DAG, and therefore could not be loaded by the scheduler, you can find the logs in ${AIRFLOW_HOME}/logs/scheduler/${date}/your_dag_id.log

Arflow in ECS cluster

I have previously used Airflow running Ubuntu VM. I was able to login into VM via WinSCP and putty to run commands and edit Airflow related files as required.
But i first time came across Airflow running AWS ECS cluster. I am new to ECS So, i am trying to see what is the best possible way to :
run commands like "airflow dbinit", stop/start web-server and scheduler etc...
Install new python packages like "pip install "
View and edit Airflow files in ECS cluster
I was reading about AWS CLI and ECS cli could they be helpful ? or is there is any other best possible way that lets me do above mentioned actions.
Thanks in Advance.
Kind Regards
P.
There are many articles that describe how to run Airflow on ECS:
https://tech.gc.com/apache-airflow-on-aws-ecs/
https://towardsdatascience.com/building-an-etl-pipeline-with-airflow-and-ecs-4f68b7aa3b5b
https://aws.amazon.com/blogs/containers/running-airflow-on-aws-fargate/
many more
[ Note: Fargate allows you to run containers via ECS without a need to have EC2 instances. More here if you want additional background re what Fargate is]
Also, AWS CLI is a generic CLI that maps all AWS APIs (mostly 1:1). You may consider it for your deployment (but it should not be your starting point).
The ECS CLI is a more abstracted CLI that exposes higher level constructs and workflows that are specific to ECS. Note that the ECS CLI has been superseded by AWS Copilot which is an opinionated CLI to deploy workloads patterns on ECS. It's perhaps too opinionated to be able to deploy Airflow.
My suggestion is to go through the blogs above and get inspired.

Using DockerOperator with CeleryExecutor in airflow DAG

at this time I use the LocalExecutor with airflow. My DAG is using docker images with the DockerOperator coming with airflow. For this the docker images must be present on the PC. If I want to use a distributed executor like CeleryExecutor or KubernetesExecutor the docker images must be present on all the machines which are part of the Celery or Kubernetes cluster?
Regards
Oli
That is correct. Since airflow workers run tasks locally, you will require to have docker images or nay other resources available locally on the worker. You could try this link to setup local docker registry which can serve docker images and save you efforts of maintaining them manually.

Best way to distribute code to airflow webserver / scheduler + workers and workflow

What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.

Resources