We are experimenting with Apache Airflow (version 1.10rc2, with python 2.7) and deploying it to kubernetes, webserver and scheduler to different pods, and the database is as well using cloud sql, but we have been facing out of memory problems with the scheduler pod.
At the moment of the OOM, we were running only 4 example Dags (approximately 20 tasks). The memory for the pod is 1Gib. I've seen in other posts that a task might consume approximately 50Mib of memory when running, and all task operations are in memory, nothing is flushed to disk, so that would give already 1Gb.
Is there any rule of thumb we can use to calculate how much memory would we need for the scheduler based on parallel tasks?
Is there any tuning, apart from decreasing the parallelism, that could be done in order to decrease the use of memory in the scheduler itself?
I don't think our use case would require Dask, or Celery to horizontally scale Airflow with more machines for the workers.
Just a few more details about the confguration:
executor = Localexecutor
parallelism = 10
dag_concurrency = 5
max_active_runs_per_dag = 2
workers = 1
worker_concurrency = 16
min_file_process_interval = 1
min_file_parsing_loop_time = 5
dag_dir_list_interval = 30
The dags running at the time were example_bash_operator, example_branch_operator, example_python_operator and one quickDag we have developed.
All of them just with simple tasks / operators like DummyOperators, BranchOperatos, BashOperators in some cases but doing only echo or sleep and PythonOperators doing only sleep as well. In total it would be aproximately 40 tasks, but not all of them were running in parallel because some of them were downstream, depencies and so on, and our parallelism is set to 10, with just a single worker as described above, and dag_concurrency is set to 5.
I cant see anything abnormal in the airflow logs, and neither in the task logs.
Running just one of these dags, it seems that airflow is working accordingly.
I can see a lot of scheduler processes in the scheduler pod, each one using 0.2% of memory or more:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
461384 airflow 20 0 836700 127212 23908 S 36.5 0.4 0:01.19 /usr/bin/python /usr/bin/airflow scheduler 461397 airflow 20 0 356168 86320 5044 R 14.0 0.3 0:00.42 /usr/bin/python /usr/bin/airflow scheduler 44 airflow 20 0 335920 71700 10600 S 28.9 0.2 403:32.05 /usr/bin/python /usr/bin/airflow scheduler 56 airflow 20 0 330548 59164 3524 S 0.0 0.2 0:00.02
And this is one of the tasks running using 0.3% of memory:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
462042 airflow 20 0 282632 91120 10544 S 1.7 0.3 0:02.66 /usr/bin/python /usr/bin/airflow run example_bash_operator runme_1 2018-08-29T07:39:48.193735+00:00 --local -sd /usr/lib/python2.7/site-packages/apache_airflow-1.10.0-py2.7.egg/airflow/example_dags/example_bash_operator.py
There isn't really a concise rule of thumb to follow because it can vary so much based on your workflow.
As you've seen, the scheduler will create several fork processes. Also every task (except Dummy) will run in it's own process. Depending on the operator and data it's processing the amount of memory needed per task can vary wildly.
The parallelism setting will directly limit how many task are running simultaneously across all dag runs/tasks, which would have the most dramatic effect for you using the LocalExecutor. You can also try setting max_threads under [scheduler] to 1.
So a (very) general rule of thumb being gracious with resources:
[256 for scheduler itself] + ( [parallelism] * (100MB + [size of data you'll process]) )
Where size of data will need to change depending on whether you load a full dataset, or process chunks of it over the course of the execution of the task.
Even if you don't think you'll need to scale your cluster, I would still recommend using the CeleryExecutor, if only to isolate the scheduler and tasks from each other. That way if your scheduler or celery worker dies, it doesn't take both down. Especially running in k8, if your scheduler sigterms it's going to kill it along with any running tasks. If you run them in different pods and the scheduler pod restarts, you're tasks you can finish uninterrupted. If you have more workers, it would lessen the impact of memory/processing spikes from other tasks.
Related
I am trying to setup airflow cluster for my project and I am using celery executor as the executor. Along with this I am using Rabbitmq as queueing service, postgresql as database. For now I have two master nodes and two worker nodes. All the services are up and running, I was able to configure my master nodes with airflow webserver and scheduler. But for my worker nodes, I am running into an issue where I get an error:
airflow command error: argument GROUP_OR_COMMAND: celery subcommand works only with CeleryExecutor, CeleryKubernetesExecutor and executors derived from them, your current executor: SequentialExecutor, subclassed from: BaseExecutor, see help above.
I did configured my airflow.cfg properly. I did set executor value to CeleryExecutor (Doesn't this mean I have set the executor value).
My airflow.cfg is as follows:
Note: I am just adding parts of the config that I think is relevant to the issue.
[celery]
# This section only applies if you are using the CeleryExecutor in
# ``[core]`` section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# ``airflow celery worker`` command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
worker_concurrency = 16
# The maximum and minimum concurrency that will be used when starting workers with the
# ``airflow celery worker`` command (always keep minimum processes, but grow
# to maximum if necessary). Note the value should be max_concurrency,min_concurrency
# Pick these numbers based on resources on worker box and the nature of the task.
# If autoscale option is available, worker_concurrency will be ignored.
# http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-autoscale
# Example: worker_autoscale = 16,12
# worker_autoscale =
# Used to increase the number of tasks that a worker prefetches which can improve performance.
# The number of processes multiplied by worker_prefetch_multiplier is the number of tasks
# that are prefetched by a worker. A value greater than 1 can result in tasks being unnecessarily
# blocked if there are multiple workers and one worker prefetches tasks that sit behind long
# running tasks while another worker has unutilized processes that are unable to process the already
# claimed blocked tasks.
# https://docs.celeryproject.org/en/stable/userguide/optimizing.html#prefetch-limits
worker_prefetch_multiplier = 1
# Specify if remote control of the workers is enabled.
# When using Amazon SQS as the broker, Celery creates lots of ``.*reply-celery-pidbox`` queues. You can
# prevent this by setting this to false. However, with this disabled Flower won't work.
worker_enable_remote_control = true
# Umask that will be used when starting workers with the ``airflow celery worker``
# in daemon mode. This control the file-creation mode mask which determines the initial
# value of file permission bits for newly created files.
worker_umask = 0o077
# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentally
# a sqlalchemy database. Refer to the Celery documentation for more information.
broker_url = amqp://admin:password#{hostname}:5672/
# The Celery result_backend. When a job finishes, it needs to update the
# metadata of the job. Therefore it will post a message on a message bus,
# or insert it into a database (depending of the backend)
# This status is used by the scheduler to update the state of the task
# The use of a database is highly recommended
# http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-result-backend-settings
result_backend = db+postgresql://postgres:airflow#postgres/airflow
# The executor class that airflow should use. Choices include
# ``SequentialExecutor``, ``LocalExecutor``, ``CeleryExecutor``, ``DaskExecutor``,
# ``KubernetesExecutor``, ``CeleryKubernetesExecutor`` or the
# full import path to the class when using a custom executor.
executor = CeleryExecutor
Please let me know if I haven't added sufficient information pertinent to my problem. Thank you.
The reason for the above error could be:-
Airflow is picking the default value of the executor which is in the core section of airflow.cfg (i.e SequentialExecutor). This is the template for Airflow's default configuration. When Airflow is imported, it looks for a configuration file at $AIRFLOW_HOME/airflow.cfg. If it doesn't exist, Airflow uses this template.
The following solution is applicable if you are using the official helm chart:-
Change the default value of the executor in the core section of airflow.cfg.
Snapshot of default configuration
Pass the environment variable called AIRFLOW_HOME in the flower deployment/container. You can simply pass environment variables in all the containers by passing the following in the values file of the helm chart:
env:
- name: "AIRFLOW_HOME"
value: "/path/to/airflow/home"
In case the airflow user doesn't have access to the path you passed in the environment variable AIRFLOW_HOME, run the flower container as a root user which can be done by passing the following config in the values file of helm chat.
flower:
enabled: true
securityContext:
runAsUser: 0
I have installed apache-airflow==1.10.8. I have around 200 dag files inside AIRFLOW_HOME folder. Each dag files may take an execution time of around 20 seconds. I have scheduled each dag file to run at every 2 minutes ' */2 * * * *'
. But when i see the logs of any particular dag, i see that dags are not executed for every 2 minutes. Attached the executed times that i got from the logs directory for a particular dag.
2020-06-02T10:14:00+00:00
2020-06-02T10:24:00+00:00
2020-06-02T10:34:00+00:00
2020-06-02T10:44:00+00:00
2020-06-02T11:14:00+00:00
2020-06-02T11:24:00+00:00
Following are the configurations in airflow.cfg
LocalExecutor ,
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16
dagbag_import_timeout = 30
dag_file_processor_timeout = 50
task_runner = StandardTaskRunner
How can i make the airflow to execute the dags at every 2 minutes?
Additional details: Ubuntu 18.04 and Python 3.7
Running airflow (v1.10.5) dag that ran fine with SequentialExecutor now has many (though not all) simple tasks that fail without any log information when running with LocalExecutor and minimal parallelism, eg.
<airflow.cfg>
# overall task concurrency limit for airflow
parallelism = 8 # which is same as number of cores shown by lscpu
# max tasks per dag
dag_concurrency = 2
# max instances of a given dag that can run on airflow
max_active_runs_per_dag = 1
# max threads used per worker / core
max_threads = 2
# 40G of RAM available total
# CPUs: 8 (sockets 4, cores per socket 4)
see https://www.astronomer.io/guides/airflow-scaling-workers/
Looking at the airflow-webserver.* logs nothing looks out of the ordinary, but looking at airflow-scheduler.out I see...
[airflow#airflowetl airflow]$ tail -n 20 airflow-scheduler.out
....
[2019-12-18 11:29:17,773] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table1 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table2 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table3 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - Set 1 task instances to state=None as their associated DagRun was not in RUNNING state
[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table4 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success for try_number 1
....
but not really sure what to take away from this.
Anyone know what could be going on here or how to get more helpful debugging info?
Looking again at my lscpu specs, I noticed...
[airflow#airflowetl airflow]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Notice Thread(s) per core: 1
Looking at my airflow.cfg settings I see max_threads = 2. Setting max_threads = 1 and restarting both the scheduler seems to have fixed the problem.
If anyone knows more about what exactly is going wrong under the hood (eg. why the task fails rather than just waiting for another thread to become available), would be interested to hear about it.
Our airflow installation is using CeleryExecutor.
The concurrency configs were
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 16
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 64
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16
We have a dag that executes daily. It has around some tasks in parallel following a pattern that senses whether the data exists in hdfs then sleep 10 mins, and finally upload to s3.
Some of the tasks has been encountering the following error:
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: example_dag.task1 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
This kind of error occurs randomly in those tasks. When this error happens, the state of task instance is immediately set to up_for_retry, and no logs in the worker nodes. After some retries, they execute and finished eventually.
This problem sometimes gives us large ETL delay. Anyone knows how to solve this problem?
We were facing similar problems , which was resolved by
"-x, --donot_pickle" option.
For more information :- https://airflow.apache.org/cli.html#backfill
I was seeing very similar symptoms in my DagRuns. I thought it was due to the ExternalTaskSensor and concurrency issues given the queuing and killed task language that looked like this: Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally? But when I looked at the worker logs, I saw there was an error caused by setting a variable with Variable.set in my DAG. The issue is described here duplicate key value violates unique constraint when adding path variable in airflow dag where the scheduler polls the dagbag at regular intervals to refresh any changes dynamically. The error with every heartbeat was causing significant ETL delays.
Are you performing any logic in your wh_hdfs_to_s3 DAG (or others) that might be causing errors or delays / these symptoms?
We fixed this already. Let me answer myself question:
We have 5 airflow worker nodes. After installing flower to monitor the tasks distributed to these nodes. We found out that the failed task was always sent to a specific node. We tried to use airflow test command to run the task in other nodes and they worked. Eventually, the reason was a wrong python package in that specific node.
One of my cassandra cluster node shows following result when I do 'top'
PID USER PR NI VIRT RES SHR S **%CPU** %MEM IME+ **COMMAND**
13351 root 20 0 20776 1476 324 R **100** 0.0 2646:32 **whiptail**
I have not started this "whiptail" (Dont even know what is it) and it is somehow started and consuming 100% of my CPU and making my node unreachable.
How can I get rid of this? If i kill it will it make system unstable or any harm?
Whiptail is often used to write interactive shell scripts. Also, there are several packages that use it -
alsa-utils
signing-party
rcconf
module-assistan
modconf
gkdebconf
ubuntu-minimal
psfontmgr
pppoeconf
pppconfig
gdm
friendly-recovery
defoma
debian-goodies
debconf
Did you make any recent changes to the system? You can kill it using kill -9 PID. If some package is using it then it would kick back in which might help you identify the root cause.