Composer "Dag Bag size" much larger than number of DAGs - airflow

I have a Cloud Composer environment, version composer-1.17.1-airflow-2.1.2. In this environment I currently have 27 DAGs.
However, when I go to the Monitoring page for the environment, and click on "Show DAG Metrics," the chart for "DAG bag size" shows a size of 137.
Is it cause for concern to see that it isn't equal to the number of DAGs? Could it actually reflect the total number of DAG Runs, or perhaps the number of operators? What influences the size of this metric, and how do I know if the metric is healthy?

As per Cloud Composer metrics document, it shows the number of DAGs deployed in a bucket and processed by Airflow at a given time.
DAG bag size
A chart showing the number of DAGs deployed to a Cloud Storage bucket and processed by Airflow at a given time. This can be helpful when analyzing performance bottlenecks. For example, an increased number of DAG deployments may degrade performance due to excessive load.
So in your case that should be 27, and it is not possible for other Composer instances to share a bucket. Thus the increase in number should be because of the DAGs processed/run at a given time.

Related

Airflow - Graph / Tree View Of Previous Dynamic Dag Runs

I have a dynamic dag that creates different tasks (and flows) on each run.
The issue is that the Graph or Tree View always shows only the last run flow (and tasks).
So if I want to see the logs from let's say 2 executions ago, I have to go to the log files themselves (because its tasks are not present at all in the last run).
Is there a way to see the actual dag run (that is not based on last dynamic dag run) view somewhere - so that I can see tasks that do not exist anymore?
Thanks.

Google Cloud Functions minInstances Pricing

Is there any pricing information regarding deploying cloud functions with minInstances greater than 0? Also, when I deployed the function with runWith({ minInstances: 1}), the edit function page at console.cloud.google.com does not reflect the change.
You can find pricing info on the Firebase Pricing Page. Note this info to figure out the cost of minimum instances
kept idle refers to idle billable time of instances kept warm using minimum instances.
You can also get a quote when you deploy:
Change your function to use minInstances
export const example = functions.https.onCall( ...
export const example = functions.runWith({minInstances: 1}).https.onCall( ...
Deploy via firebase deploy --only functions
The command line should prompt you with a quote and confirmation input such as:
functions: The following functions have reserved minimum instances. This will reduce the frequency of cold starts but increases the minimum cost. You will be charged for the memory allocation and a fraction of the CPU allocation of instances while they are idle.
With these options, your minimum bill will be $153.75 in a 30-day month
? Would you like to proceed with deployment? (y/N)
Also a tip: I use cloud scheduler to keep my functions warm as it's a small fraction of the cost and works just as well for my use case.

DAG's task initialization takes time

We have a composer environment which has below configuration details.
Composer Version:  composer-1.10.0-airflow-1.10.6
Machine Type : n1-standard-4
Disk size (GB): 100
Worker Nodes: 6
python version:3
worker_concurrency: 32
parallelism:128
We have a problem in DAG to initialize it's task and it is taking more time. For example DAG has 3 tasks like Task1 -> Task2 -> Task3. Task1 initializes taking time (minimum 5 mins) and once initialized completion time of that task within seconds. Task2 initialized taking again 5 mins and executed within seconds. Like that task's initialization is taking time but completion of that task is quickly done. Have scheduled this DAG every 5 mins, but completing this DAG takes around 10 mins at least. So affecting functionalities and execution of the process.
Here are the functionalities of each three tasks. Task1 objective is to gather the basic information such as storage location from configuration files/variables. Task2 checks the storage whether any new files are coming and based on the file triggers the relevant DAGs. Task3 objective is to send success email.
Also, I noted that worker nodes did not splitted the work among themselves. Always one worker node's CPU utilization is high compared to other worker nodes. Do not know what could be the reason for it. One more interesting is even though the other DAG's are not running at that time this DAG still takes 10 mins to execute.
Appreciated your help in solving this case.
This should be a comment but I don't have the reputation required.
My initial advice is to upgrade your Composer version, 1.10.0 has a few known bugs that are fixed in later versions. Right now the latest version is 1.10.4. This should correct the CPU that stays at 100% (it did in our case). Are there many other DAGs running on your instance?
As I mentioned in the comment the reason behind the high CPU pressure on the particular GKE node might be more evident after the troubleshooting performed on Airflow workflow/GKE sides.
It is happening regularly that on some Aiflow runtime node the computation resources (Memory/CPU) are running out of the node capacity causing Airflow workloads(Pods) being Evicted and further restarted loosing all the process states data, however Celery executor which is responsible for assigning tasks to the Airflow workers can even be not aware about inconvenient state/time-out of the worker and doesn't keep the certain action to re-assign this task to another worker.
According to GCP Composer release notes, the vendor has provided some essential fixes in the latest composer-1.10.* patches, improving Composer runtime performance and reability, as #parakeet said in his answer.
You can also refer to this GCP Composer known issues knowledge base to keep track of the current problems and workarounds that vendor shares to the community.

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Specify parallelism per task?

I know in the cfg I can set the parallelism, but is there a way to do it per task, or at least per dag?
dag1=
task_id: 'download_sftp'
parallelism: 4 #I am fine with downloading multiple files at once
task_id: 'process_dimensions'
parallelism: 1 #I want to make sure the dimensions are processed one at a time to prevent conflicts with my 'serial' keys
task_id: 'process_facts'
parallelism: 4 #It is fine to have multiple tables processed at once since there will be no conflicts
dag2 (separate file)=
task_id: 'bcp_query'
parallelism: 6 #I can query separate BCP commands to download data quickly since it is very small amounts of data
You can create a task pool through the web gui and limit the execution parallelism by specifying the specific tasks to use that pool.
Please see: https://airflow.apache.org/concepts.html#pools
The number of active DAG runs can be controlled with the below parameter(present in airflow.cfg configuration file), its applicably globally.
By default, its set to 16, change it to 1 ensures only one instace of dag at a time and rest gets queued.
#The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
How to limit Airflow to run only 1 DAG run at a time? --> Suggests how to control concurrency per dag

Resources