Airflow dag dependencies - airflow

I have a airflow dag-1 that runs approximately for week and dag-2 that runs every day for few hours. When the dag-1 is running i cannot have the dag-2 running due to API limit rate (also dag-2 is supposed to run once dag-1 is finished).
Suppose the dag-1 is running already, then dag-2 that is supposed to run everyday fails, is there a way i can schedule the dag dependencies in a right way?
Is it possible to stop dag-1 temporarily(while running) when dag-2 is supposed to start and then run dag-1 again without manual interruption?

One of the best way is to use the defined pool ..
Lets say if you have a pool named: "specefic_pool" and allocate only one slot for it.
Specify the pool name in your dag bash command (instead of default pool, please use newly created pool) By that way you may over come of running both the dags parallel .
This helps whenever Dag1 is running Dag2 will never be triggered until pool is free or if the dag2 picked the pool until dag2 is completed dag1 is not going to get triggered.

Related

mwaa restart functionality for requirements.txt updates

Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.

Airflow - How to configure that all DAG's tasks run in 1 worker

I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.

Is there way to have 3 set of worker nodes (groups) for airflow

We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?

Tasks in airflow not running in order

I'm using airflow to automatize some machine learning models. Everything it's successfull but i have issues according to the order of the tasks.
I have a 7 tasks running in paralel and the last two tasks must start when those 7 tasks finish.
At the time 6 tasks finishes, the last two start without waiting for the 7th taks to finish.
Here's the image of whats happening.
It appears that trigger_rule of your creation_order_cell_task is incorrect (fot the desired behaviour)
To get the behaviour you want, it should be either ALL_SUCCESS (default) or ALL_DONE

Airflow parallelism

the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.

Resources