Sharing large intermediate state between Airflow tasks - airflow

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?

To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.

Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

Related

Airflow - How to configure that all DAG's tasks run in 1 worker

I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.

Where is /tmp folder located in Airflow?

I know that we can share information between tasks by persisting data to /tmp location. Since every task could run on a different machine. How is /tmp from one task is available to the other task in the Airflow?
This sharing you mentioned is only possible if you use LocalExecutor - because all tasks run on the same machine in this case.
If you use Celery/Kubernetes executors, the "/tmp" data share will not work. It might work accidentaly on Celery Executor if the tasks are executed on the same machine, but this is not at all guaranteed. You could potentially put all the tasks that need such sharing on a single machine, in Celery Executor using "queues" - when you create a single celery worker for one queue, all tasks assigned to that queue will be executed on that machine (thus /tmp sharing will work), but this hevily impacts scalability and resilience scenarios.
And also you could have Airlfow Deployment where /tmp is put on a shared filesystem, but this would be terribly inefficient as "/tmp" folder is often used for local caching of files and using a network filesystem there would severly impact performance.

Airflow resource pool usage on DAG-level?

I'm looking at using airflow for scheduling test-cases execution against shared hw in a lab and have some best practice questions on how to use the resource pool concept for a whole DAG-instance instead of just on task level.
Basically a test-case needs (executed as a instance of a test-case DAG (deploy/execute/collect/un-deploy)) certain physical resources and should therefore request them from the different resource pools(modelling the the physical resources) in order to not run into conflicting concurrent usage with other triggered DAG-instances.
My question is if it's possible to define resource usage on DAG-instance level or if it's only possible on task level. If the latter, then would one parallel task claiming the resource during the whole DAG-instance execution be the best way to handle not having to pass the resource claim between all tasks in the DAG? Other alternatives?
Update after questions from Viraj and dlamblin:
Running 1.10.1
Running LocalExecutor
Have verified that I can run parallel DAGS with concurrent tasks
The resources I want to have custom pools for are not worker resources, rather different peripheral hw units such as relays, routers etc that the tasks running in parallel on a the localexecutor should block on if they are occupied(0 custom resource pool instances left) by an/-other task(s)
The Kubernetes Executor allows for certain node type affinity to be configured on the task or dag level. The Celery Executor has a queue concept to select from a worker group with certain resources available to the worker. You're probably not using a Local Executor as your question doesn't quite make sense for that case.

Create and use Connections in Airflow operator at runtime [duplicate]

This question already has answers here:
Is there a way to create/modify connections through Airflow API
(5 answers)
Closed 4 years ago.
Note: This is NOT a duplicate of
Export environment variables at runtime with airflow
Set Airflow Env Vars at Runtime
I have to trigger certain tasks at remote systems from my Airflow DAG. The straight-forward way to achieve this is SSHHook.
The problem is that the remote system is an EMR cluster which is itself created at runtime (by an upstream task) using EmrCreateJobFlowOperator. So while I can get hold of job_flow_id of the launched EMR cluster (using XCOM), what I need is to an ssh_conn_id to be passed to each downstream task.
Looking at the docs and code, it is evident that Airflow will try to look up for this connection (using conn_id) in db and environment variables, so now the problem boils down to being able to set either of these two properties at runtime (from within an operator).
This seems a rather common problem because if this isn't achievable then the utility of EmrCreateJobFlowOperator would be severely hampered; but I haven't come across any example demonstrating it.
Is it possible to create (and also destroy) either of these from within an Airflow operator?
Connection (persisted in Airflow's db)
Environment Variable (should be accessible to all downstream tasks and not just current task as told here)
If not, what are my options?
I'm on
Airflow v1.10
Python 3.6.6
emr-5.15 (can upgrade if required)
Connections come from the ORM
Yes, you can create connections at runtime, even at DAG creation time if you're careful enough. Airflow is completely transparent on its internal models, so you can interact with the underlying SqlAlchemy directly. As exemplified originally in this answer, it's as easy as:
from airflow.models import Connection
from airflow import settings
def create_conn(username, password, host=None):
new_conn = Connection(conn_id=f'{username}_connection',
login=username,
host=host if host else None)
new_conn.set_password(password)
session = settings.Session()
session.add(new_conn)
session.commit()
Where you can, of course, interact with any other extra Connection properties you may require for the EMR connection.
Environment are process-bounded
This is not a limitation of Airflow or Python, but (AFAIK for every major OS) environments are bound to the lifetime of a process. When you export a variable in bash for example, you're simply stating that when you spawn child processes, you want to copy that variable to the child's environment. This means that the parent process can't change the child's environment after its creation and the child can't change the parents environment.
In short, only the process itself can change its environment after it's created. And considering that worker process are Airflow subprocesses, it's hard to control the creation of their environments as well. What you can do is to write the environment variables into a file and intentionally update the current environment with overrides from that file on each task start.
The way you can do this is to create an Airflow task after EmrCreateJobFlowOperator, that uses BashOperator to probably use aws-cli to retrieve the IP Address of the Virtual Machine where you want to run the task and in the same task run airflow cli that creates an SSH connection using that IP address.

Airflow parallelism

the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.

Resources