We have moved to AirFlow 1.10.2 to resolve the CPU usage, Good thing is that the issue we had, got fixed in our environment. However, we have observed that the DAG's tasks though are getting submitted and shows running on the AirFlow dashboard, but they kind of hold up with actual processing and then appears to remain in the queue for about 60 seconds after that the actual execution happens. Please note that for our use case implementation
The AirFlow DAGs are not time dependent i.e. they are not '**Scheduled DAGs'** but are triggered via a python code.
AirFlow v1.10.2 is being used as a single standalone instalation [executor = LocalExecutor] .
The python code watches a directory for any file(s) that arrive. For any file, it observes, the code triggers the AirFlow DAG. We get bundles of files arriving and so at any given instance, we have scenarios where multiple instance of the same DAGs are getting invoked [ code snippet provided below ]. The DAGs are triggered which in turn has a task that calls a python code to trigger a Kubernetes pod where some file related processing happens. Please find below an excerpt from the DAG code
positional_to_ascii = BashOperator(
task_id="uncompress_the_file",
bash_command='python3.6 ' + os.path.join(cons.CODE_REPO, 'app/Code/k8Job/create_kubernetes_job.py') + ' POS-PREPROCESSING {{ dag_run.conf["inputfilepath"] }} {{ dag_run.conf["frt_id"]}}',
execution_timeout=None,
dag=dag)
Once this task completes it triggers another DAGs are triggered which has task that processes data from the output of the previous DAG.
Please find below a few details of our config file parameters which may assist in assessing the root cause.
min_file_process_interval = 60
dag_dir_list_interval = 300
max_threads = 2
dag_concurrency = 16
worker_concurrency = 16
max_active_runs_per_dag = 16
parallelism = 32
sql_alchemy_conn = mysql://airflow:fewfw324$gG#someXserver:3306/airflow
executor = LocalExecutor
The DagBag parsing time: 1.305286. Please as well find below the snapshot of the command airflow list_dags -r below
-------------------------------------------------------------------
DagBag loading stats for /root/airflow/dags
-------------------------------------------------------------------
Number of DAGs: 7
Total task number: 23
DagBag parsing time: 1.305286
------------------------------+----------+---------+----------+------------------------------
file | duration | dag_num | task_num | dags
------------------------------+----------+---------+----------+------------------------------
/trigger_cleansing.py | 0.876388 | 1 | 5 | ['trigger_cleansing']
/processing_ebcdic_trigger.py | 0.383038 | 1 | 1 | ['processing_ebcdic_trigger']
/prep_preprocess_dag.py | 0.015474 | 1 | 6 | ['prep_preprocess_dag']
/prep_scale_dag.py | 0.012098 | 1 | 6 | ['dataprep_scale_dag']
/mvp.py | 0.010832 | 1 | 2 | ['dg_a']
/prep_uncompress_dag.py | 0.004142 | 1 | 2 | ['dataprep_unzip_decrypt_dag']
/prep_positional_trigger.py | 0.003314 | 1 | 1 | ['prep_positional_trigger']
------------------------------+----------+---------+----------+------------------------------
Below is the status of the airflow-scheduler service which is showing multiple processes
systemctl status airflow-scheduler
● airflow-scheduler.service - Airflow scheduler daemon
Loaded: loaded (/etc/systemd/system/airflow-scheduler.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2019-03-09 04:44:29 EST; 33min ago
Main PID: 37409 (airflow)
CGroup: /system.slice/airflow-scheduler.service
├─37409 /usr/bin/python3.6 /bin/airflow scheduler
├─37684 /usr/bin/python3.6 /bin/airflow scheduler
├─37685 /usr/bin/python3.6 /bin/airflow scheduler
├─37686 /usr/bin/python3.6 /bin/airflow scheduler
├─37687 /usr/bin/python3.6 /bin/airflow scheduler
├─37688 /usr/bin/python3.6 /bin/airflow scheduler
├─37689 /usr/bin/python3.6 /bin/airflow scheduler
├─37690 /usr/bin/python3.6 /bin/airflow scheduler
├─37691 /usr/bin/python3.6 /bin/airflow scheduler
├─37692 /usr/bin/python3.6 /bin/airflow scheduler
├─37693 /usr/bin/python3.6 /bin/airflow scheduler
├─37694 /usr/bin/python3.6 /bin/airflow scheduler
├─37695 /usr/bin/python3.6 /bin/airflow scheduler
├─37696 /usr/bin/python3.6 /bin/airflow scheduler
├─37697 /usr/bin/python3.6 /bin/airflow scheduler
├─37699 /usr/bin/python3.6 /bin/airflow scheduler
├─37700 /usr/bin/python3.6 /bin/airflow scheduler
├─37701 /usr/bin/python3.6 /bin/airflow scheduler
├─37702 /usr/bin/python3.6 /bin/airflow scheduler
├─37703 /usr/bin/python3.6 /bin/airflow scheduler
├─37704 /usr/bin/python3.6 /bin/airflow scheduler
├─37705 /usr/bin/python3.6 /bin/airflow scheduler
├─37706 /usr/bin/python3.6 /bin/airflow scheduler
├─37707 /usr/bin/python3.6 /bin/airflow scheduler
├─37708 /usr/bin/python3.6 /bin/airflow scheduler
├─37709 /usr/bin/python3.6 /bin/airflow scheduler
├─37710 /usr/bin/python3.6 /bin/airflow scheduler
├─37712 /usr/bin/python3.6 /bin/airflow scheduler
├─37713 /usr/bin/python3.6 /bin/airflow scheduler
├─37714 /usr/bin/python3.6 /bin/airflow scheduler
├─37715 /usr/bin/python3.6 /bin/airflow scheduler
├─37717 /usr/bin/python3.6 /bin/airflow scheduler
├─37718 /usr/bin/python3.6 /bin/airflow scheduler
└─37722 /usr/bin/python3.6 /bin/airflow scheduler
Now the fact that we have several files comming in the DAGs are constantly being fired and have ample DAG task that gets into a waiting stage. Strangely we though didnt have this issue when we were using v1.9 please advise.
I realized that in the 'airflow.cfg' file , the value of the 'min_file_process_interval' was 60. Setting that to zero resolved the problem I reported here.
Related
I have airflow installed on kubernetes cluster. Ive created some dags nevertheless any time i restart server for example by: cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
server deletes all my dags i created. How to keep dags even on webserver restarts?
I have an Airflow deployment running in a Kubernetes cluster. I'm trying to use the CLI to backfill one of my DAGs by doing the following:
I open a shell to my scheduler node by running the following command: kubectl exec --stdin --tty airflow-worker-0 -- /bin/bash
I then execute the following command to initiate the backfill - airflow dags backfill -s 2021-08-06 -e 2021-08-31 my_dag
It then hangs on the below log entry indefinitely until I terminate the process:
[2022-05-31 13:04:25,682] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags
I then get an error similar to the below, complaining that a random DAG that I don't care about can't be found:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/dags/__pycache__/example_dag-37.pyc'
Is there any way to address this? I don't understand why the CLI has to fill up the DagBag given that I've already told it exactly what DAG I want to execute - why is it then looking for random DAGs in the pycache folder that don't exist?
I would like to ask the question about whether command of Apache Airflow is work or not. Thank you.
OS Version: Ubuntu 20.04.4 LTS
Apache Airflow Version: 2.2.5
Airflow Command:
airflow scheduler --pid /xxx/xxx.pid
Expected Result: process id file will exist in /xxx/xxx.pid
You must run airflow scheduler command with the daemon option together with the pid option.
Run
airflow schedule -D --pid /xxx/xxx.pid
I am new to airflow, tried to run a dag by starting airflow webserver and scheduler. After I closed the scheduler and airflow webserver, the airflow processes are still running.
ps aux | grep airflow shows 2 airflow webserver running, and scheduler running for all dags.
I tried running kill $(ps aux | grep airflow | awk '{print $2}') but it did not help.
I don't have sudo permissions and webserver UI access.
If you run Airflow locally and start it with the two commands airflow scheduler and airflow webserver, then those processes will run in the foreground. So, simply hitting Ctrl-C for each of them should terminate them and all their child processes.
If you don't have those two processes running in the foreground, there is another way. Airflow creates files with process IDs of the scheduler and gunicorn server in its home directory (by default ~/airflow/).
Running
kill $(cat ~/airflow/airflow-scheduler.pid)
should terminate the scheduler.
Unfortunately, airflow-webserver.pid contains the PID of the gunicorn server and not the initial Airflow command that started it (which is the parent of the gunicorn process). So, we will first have to find the parent PID of the gunicorn process and then kill the parent process.
Running
kill $(ps -o ppid= -p $(cat ~/airflow/airflow-webserver.pid))
should terminate the webserver.
If simply running kill (i.e., sending SIGTERM) for these processes does not work you can always try sending SIGKILL: kill -9 <pid>. This should definitely kill them.
I'm using apache-airflow 1.10.10.
My use case is: I have 4 DAGs, all of them triggers a common DAG named "dag_common" with different conf parameters using BashOperator after some work.
airflow trigger_dag -c {"id":"1"} dag_common
airflow trigger_dag -c {"id":"2"} dag_common
airflow trigger_dag -c {"id":"3"} dag_common
airflow trigger_dag -c {"id":"4"} dag_common
Inside these DAGs I have to wait for the triggered DAG to finish, how can I accomplish this?
Dag1 has to wait until finish dag_common with conf id=1.
Is there any way to find all dag runs with specific conf?
It looks like a use case for SubDAGs: Implement dag_common as a subDAG and use SubDagOperator() in those four DAGs to run dag_common.