Possibly a dumb question, but when I read the Airflow Arch for 2.4.2, there was no mention of Triggerer. The only pieces were Metastore, Webserver, Scheduler(executor is part of this).
Having said this, do we still need Triggerer pod in airflow 2.4.2 if the deployment is on EKS, and executor is KubernetesExecutor ?
What does the Triggerer pod do here?
Thanks
It’s needed for the deferrable operator, which is a pretty advanced operator if you’re just getting started. That operator frees up worker slots when a worker makes an API call that takes a while to execute, and marks the execution is awaiting a trigger. It’s completion is detected by the triggerer, which then is checked by the scheduler process when reassigning the worker to subsequent tasks.
Related
I am new to Airflow and I have done a few bad practices that include marking a task as failed while running and then Airflow leaves behind this task as a ghost running job and I cannot find a way to kill it. I have seen the option on_kill(), but I am not really sure how to implement it.
More specifically, I am starting some servers with a task called get_proxies (that start some servers to get a list of proxies to be used in the pipeline) via SSH; it usually takes some minutes to finish and then it continues with the pipeline; at the end of it I have a task called destroy_proxies (also via SSH), which, as its name suggests, destroys the servers where the proxies are running, and then also stops the containers (docker-compose down -d) and its trigger rule is all_done, so even in case of failure of the pipeline, it destroys the proxies.
Last time, while I was doing some tests and while the task get_proxies was running, I decided to mark it manually as Failed, so the task destroy_proxies was a few seconds later successfully executed, and of course it destroyed the proxies that were (so far) created. However, I noticed afterwards that the proxies were still running, so I need a way to handle those crashes or manually marked failed/success because it leaves those running-jobs in the background and there is no way for me to stop/kill those (I already have a workaround to destroy such proxies but still).
More specifically, I have two questions: 1) How to kill those running jobs? and 2) How to handle those tasks being killed externally so that they do not leave ghost jobs later on that are hard to access?
Some information about my Airflow version:
Apache Airflow
version | 2.3.2
executor | CeleryExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
dags_folder | /opt/airflow/airflow-app/dags
plugins_folder | /opt/airflow/plugins
base_log_folder | /opt/airflow/logs
remote_base_log_folder |
System info
OS | Linux
architecture | x86_64
locale | ('en_US', 'UTF-8')
python_version | 3.9.13 (main, May 28 2022, 14:03:04) [GCC 10.2.1 20210110]
python_location | /usr/local/bin/python
For the last task, destroy_proxies, which is a SSHOperator, I was thinking of adding some wait time to the command being executed, but it is not a great solution, since the get_proxies task does not always last the same time.
On the other hand (but maybe it is worth asking a separate question for it), sometimes when I trigger the DAG manually, I get some failed tasks with no logs, and I might think it is related to it since there are some extra running jobs in the background so it might be facing memory issues...
It is my first time writing a question here, so I hope I am being clear, but in any case if more information is needed, or what I wrote is not entirely clear, I am always willing to re-write it and provide more info. Thank you all!
Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.
I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.
In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.
Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl
We have a kubernetes pod operator that will spit out a python dictionary that will define which further downstream kubernetes pod operators to run along with their dependencies and the environment variables to pass into each operator.
How do I get this python dictionary object back into the executor's context (or is it worker's context?) so that airflow can spawn the downstream kubernetes operators?
I've looked at BranchOperator and TriggerDagRunOperator and XCOM push/pull and Variable.get and Variable.set, but nothing seems to quite work.
We have a kubernetes pod operator that will spit out a python
dictionary that will define which further downstream kubernetes pod
operators to run
This is possible, albeit not in the way you are trying. You'll have to have all possible KubernetesPodOperators already in your workflow and then skip those that need not be run.
An elegant way to do this would be to attach a ShortCircuitOperator before each KubernetesPodOperator that reads the XCom (dictionary) published by the upstream KubernetesPodOperator and determines whether or not to continue with the downstream task.
EDIT-1
Actually a cleaner way would be to just raise an AirflowSkipException within the task that you want to skip (rather than using a separate ShortCircuitOperator to do this)
How do I get this python dictionary ... so that airflow can spawn the
downstream kubernetes operators..
No. You can't dynamically spawn new tasks based on output of an upstream task.
Think of it this way: for scheduler it is imperative to know all the tasks (their task_ids, trigger_rules, priority_weight etc) ahead of time so as to be able to execute them when the right time comes. If the tasks were to just keep coming up dynamically then Airflow's scheduler would have to become akin to an Operating System scheduler (!). For more details read the EDIT-1 part of this answer
I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?
"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.