flask celery get task real time task status [duplicate] - asynchronous

How does one check whether a task is running in celery (specifically, I'm using celery-django)?
I've read the documentation, and I've googled, but I can't see a call like:
my_example_task.state() == RUNNING
My use-case is that I have an external (java) service for transcoding. When I send a document to be transcoded, I want to check if the task that runs that service is running, and if not, to (re)start it.
I'm using the current stable versions - 2.4, I believe.

Return the task_id (which is given from .delay()) and ask the celery instance afterwards about the state:
x = method.delay(1,2)
print x.task_id
When asking, get a new AsyncResult using this task_id:
from celery.result import AsyncResult
res = AsyncResult("your-task-id")
res.ready()

Creating an AsyncResult object from the task id is the way recommended in the FAQ to obtain the task status when the only thing you have is the task id.
However, as of Celery 3.x, there are significant caveats that could bite people if they do not pay attention to them. It really depends on the specific use-case scenario.
By default, Celery does not record a "running" state.
In order for Celery to record that a task is running, you must set task_track_started to True. Here is a simple task that tests this:
#app.task(bind=True)
def test(self):
print self.AsyncResult(self.request.id).state
When task_track_started is False, which is the default, the state show is PENDING even though the task has started. If you set task_track_started to True, then the state will be STARTED.
The state PENDING means "I don't know."
An AsyncResult with the state PENDING does not mean anything more than that Celery does not know the status of the task. This could be because of any number of reasons.
For one thing, AsyncResult can be constructed with invalid task ids. Such "tasks" will be deemed pending by Celery:
>>> task.AsyncResult("invalid").status
'PENDING'
Ok, so nobody is going to feed obviously invalid ids to AsyncResult. Fair enough, but it also has for effect that AsyncResult will also consider a task that has successfully run but that Celery has forgotten as being PENDING. Again, in some use-case scenarios this can be a problem. Part of the issue hinges on how Celery is configured to keep the results of tasks, because it depends on the availability of the "tombstones" in the results backend. ("Tombstones" is the term use in the Celery documentation for the data chunks that record how the task ended.) Using AsyncResult won't work at all if task_ignore_result is True. A more vexing problem is that Celery expires the tombstones by default. The result_expires setting by default is set to 24 hours. So if you launch a task, and record the id in long-term storage, and more 24 hours later, you create an AsyncResult with it, the status will be PENDING.
All "real tasks" start in the PENDING state. So getting PENDING on a task could mean that the task was requested but never progressed further than this (for whatever reason). Or it could mean the task ran but Celery forgot its state.
Ouch! AsyncResult won't work for me. What else can I do?
I prefer to keep track of goals than keep track of the tasks themselves. I do keep some task information but it is really secondary to keeping track of the goals. The goals are stored in storage independent from Celery. When a request needs to perform a computation depends on some goal having been achieved, it checks whether the goal has already been achieved, if yes, then it uses this cached goal, otherwise it starts the task that will effect the goal, and sends to the client that made the HTTP request a response that indicates it should wait for a result.
The variable names and hyperlinks above are for Celery 4.x. In 3.x the corresponding variables and hyperlinks are: CELERY_TRACK_STARTED, CELERY_IGNORE_RESULT, CELERY_TASK_RESULT_EXPIRES.

Every Task object has a .request property, which contains it AsyncRequest object. Accordingly, the following line gives the state of a Task task:
task.AsyncResult(task.request.id).state

You can also create custom states and update it's value duting task execution.
This example is from docs:
#app.task(bind=True)
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})
http://celery.readthedocs.org/en/latest/userguide/tasks.html#custom-states

Old question but I recently ran into this problem.
If you're trying to get the task_id you can do it like this:
import celery
from celery_app import add
from celery import uuid
task_id = uuid()
result = add.apply_async((2, 2), task_id=task_id)
Now you know exactly what the task_id is and can now use it to get the AsyncResult:
# grab the AsyncResult
result = celery.result.AsyncResult(task_id)
# print the task id
print result.task_id
09dad9cf-c9fa-4aee-933f-ff54dae39bdf
# print the AsyncResult's status
print result.status
SUCCESS
# print the result returned
print result.result
4

Just use this API from celery FAQ
result = app.AsyncResult(task_id)
This works fine.

Answer of 2020:
#### tasks.py
#celery.task()
def mytask(arg1):
print(arg1)
#### blueprint.py
#bp.route("/args/arg1=<arg1>")
def sleeper(arg1):
process = mytask.apply_async(args=(arg1,)) #mytask.delay(arg1)
state = process.state
return f"Thanks for your patience, your job {process.task_id} \
is being processed. Status {state}"

Try:
task.AsyncResult(task.request.id).state
this will provide the Celery Task status. If Celery Task is already is under FAILURE state it will throw an Exception:
raised unexpected: KeyError('exc_type',)

I found helpful information in the
Celery Project Workers Guide inspecting-workers
For my case, I am checking to see if Celery is running.
inspect_workers = task.app.control.inspect()
if inspect_workers.registered() is None:
state = 'FAILURE'
else:
state = str(task.state)
You can play with inspect to get your needs.

First,in your celery APP:
vi my_celery_apps/app1.py
app = Celery(worker_name)
and next, change to the task file,import app from your celery app module.
vi tasks/task1.py
from my_celery_apps.app1 import app
app.AsyncResult(taskid)
try:
if task.state.lower() != "success":
return
except:
""" do something """

res = method.delay()
print(f"id={res.id}, state={res.state}, status={res.status} ")
print(res.get())

for simple tasks, we can use http://flower.readthedocs.io/en/latest/screenshots.html and http://policystat.github.io/jobtastic/ to do the monitoring.
and for complicated tasks, say a task which deals with a lot other modules. We recommend manually record the progress and message on the specific task unit.

Apart from above Programmatic approach
Using Flower Task status can be easily seen.
Real-time monitoring using Celery Events.
Flower is a web based tool for monitoring and administrating Celery clusters.
Task progress and history
Ability to show task details (arguments, start time, runtime, and more)
Graphs and statistics
Official Document:
Flower - Celery monitoring tool
Installation:
$ pip install flower
Usage:
http://localhost:5555
Update:
This has issue with versioning, flower (version=0.9.7) works only with celery (version=4.4.7) more over when you install flower, it uninstalls your higher version of celery into 4.4.7 and this never works for registered tasks

Related

Custom Operator States (queued, success, etc.) in Apache Airflow?

In Apache Airflow (2.x), each Operator Instance has a state as defined here (airflow source repo).
I have two use cases that don't seem to clearly fall into the pre-defined states:
Warn, but don't fail - This seems like it should be a very standard use case and I am surprised to not see it in the out-of-the-box airflow source code. Basically, I'd like to color-code a node with something eye-catching - say orange - corresponding to a non-fatal warning, but continue execution as normal otherwise. Obviously you can print warnings to the log, but finding them takes more work than just looking at the colorful circles on the DAGs page.
"Sensor N/A" or "Data not ready" - This would be a status that gets assigned when a sensor notices that data in the source system is not yet ready, and that downstream operators can be skipped until the next execution of the DAG, but that nothing in the data pipeline is broken. Basically an expected end-of-branch.
Is there a good way of achieving either of these use cases with the out-of-the-box Airflow node states? If not, is there a way to defining custom operator states? Since I am running airflow on a managed service (MWAA), I don't think changing the source code of our deployment is an option.
Thanks,
The task states are tightly integrated with Airflow. There's no way to configure which logging levels lead to which state. I'd say the easiest way is to grep log files for "WARNING" or set up a log aggregation service e.g. Elasticsearch to make log files searchable.
For #2, sensors have no knowledge about why a sensor timed out. After timeout or execution_timeout is reached, they simply raise an Exception. You can deal with exceptions using trigger_rules, but these still don't take the reason for an exception into account.
If you want more control over this, I would implement your own Sensor which takes an argument e.g. data_not_ready_timeout (which is smaller than timeout and execution_timeout). In the poke() method, check if data_not_ready_timeout has been reached, and raise an AirflowSkipException if so. This will skip downstream tasks. Once timeout or execution_timeout are reached, the task is failed. Look at BaseSensorOperator.execute() for some inspiration to get the initial starting date of a sensor.

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

Airflow Dagrun for each datum instead of scheduled

The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
​
​
def stream():
for x in range(10):
yield x
time.sleep(1)
​
​
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
​
​
#task
def transform(x):
return x * 2
​
​
#task
def load(y):
print("Received y: {}".format(y))
​
​
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
​
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.

Airflow Cluster Policy not taking effect

I'm attempting to use a Cluster Policy in Airflow 1.9. I followed the instructions in the official documentation, but it doesn't seem to be taking effect.
In my file at $AIRFLOW_HOME/config/airflow_local_settings.py, I've defined the method as the docs instructed and it has the following signature:
def policy(task_instance):
Additional concerns:
What Airflow component is actually running the policy code (is it the scheduler)?
Is there a recommended way to unit test cluster policy code? If not, what about local testing?
Can anyone help me understand why this Cluster Policy isn't taking effect?
I'm using Airflow 1.9.
So you seem to have the file in the right place according to the documents: https://github.com/apache/airflow/blob/master/docs/concepts.rst#where-to-put-airflow_local_settingspy
And your signature is right: https://airflow.apache.org/docs/stable/concepts.html#mutate-tasks-after-dag-loaded
But you haven't shown what you did and how that "did not work".
I believe the def policy(task): signature is run on the scheduler after DAG parsing (as the docs seem to say) while the def task_instance_mutation_hook(ti): signature is run by the task executor on the worker. That's probably why you're not seeing some changes take.
EG timeout or queue is something the scheduler enforces, but connection ID is something the worker needs to know during execution.
So if what you wanted to work was a timeout policy, it should have, but if what you wanted to work was a connection ID enforcement, that wouldn't have.

Airflow 1.9.0 is queuing but not launching tasks

Airflow is randomly not running queued tasks some tasks dont even get queued status. I keep seeing below in the scheduler logs
[2018-02-28 02:24:58,780] {jobs.py:1077} INFO - No tasks to consider for execution.
I do see tasks in database that either have no status or queued status but they never get started.
The airflow setup is running https://github.com/puckel/docker-airflow on ECS with Redis. There are 4 scheduler threads and 4 Celery worker tasks. For the tasks that are not running are showing in queued state (grey icon) when hovering over the task icon operator is null and task details says:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:- The scheduler is down or under heavy load
Metrics on scheduler do not show heavy load. The dag is very simple with 2 independent tasks only dependent on last run. There are also tasks in the same dag that are stuck with no status (white icon).
Interesting thing to notice is when I restart the scheduler tasks change to running state.
Airflow can be a bit tricky to setup.
Do you have the airflow scheduler running?
Do you have the airflow webserver running?
Have you checked that all DAGs you want to run are set to On in the web ui?
Do all the DAGs you want to run have a start date which is in the past?
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
I've had for instance a DAG which was wrongly set to depends_on_past: True which forbid the current instance to start correctly.
Also a great resource directly in the docs, which has a few more hints: Why isn't my task getting scheduled?.
I'm running a fork of the puckel/docker-airflow repo as well, mostly on Airflow 1.8 for about a year with 10M+ task instances. I think the issue persists in 1.9, but I'm not positive.
For whatever reason, there seems to be a long-standing issue with the Airflow scheduler where performance degrades over time. I've reviewed the scheduler code, but I'm still unclear on what exactly happens differently on a fresh start to kick it back into scheduling normally. One major difference is that scheduled and queued task states are rebuilt.
Scheduler Basics in the Airflow wiki provides a concise reference on how the scheduler works and its various states.
Most people solve the scheduler diminishing throughput problem by restarting the scheduler regularly. I've found success at a 1-hour interval personally, but have seen as frequently as every 5-10 minutes used too. Your task volume, task duration, and parallelism settings are worth considering when experimenting with a restart interval.
For more info see:
Airflow: Tips, Tricks, and Pitfalls (section "The scheduler should be restarted frequently")
Bug 1286825 - Airflow scheduler stopped working silently
Airflow at WePay (section "Restart everything when deploying DAG changes.")
This used to be addressed by restarting every X runs using the SCHEDULER_RUNS config setting, although that setting was recently removed from the default systemd scripts.
You might also consider posting to the Airflow dev mailing list. I know this has been discussed there a few times and one of the core contributors may be able to provide additional context.
Related Questions
Airflow tasks get stuck at "queued" status and never gets running (especially see Bolke's answer here)
Jobs not executing via Airflow that runs celery with RabbitMQ
Make sure you don't have datetime.now() as your start_date
It's intuitive to think that if you tell your DAG to start "now" that it'll execute "now." BUT, that doesn't take into account how Airflow itself actually reads datetime.now().
For a DAG to be executed, the start_date must be a time in the past, otherwise Airflow will assume that it's not yet ready to execute. When Airflow evaluates your DAG file, it interprets datetime.now() as the current timestamp (i.e. NOT a time in the past) and decides that it's not ready to run. Since this will happen every time Airflow heartbeats (evaluates your DAG) every 5-10 seconds, it'll never run.
To properly trigger your DAG to run, make sure to insert a fixed time in the past (e.g. datetime(2019,1,1)) and set catchup=False (unless you're looking to run a backfill).
By design, an Airflow DAG will execute at the completion of its schedule_interval
That means one schedule_interval AFTER the start date. An hourly DAG, for example, will execute its 2pm run when the clock strikes 3pm. The reasoning here is that Airflow can't ensure that all data corresponding to the 2pm interval is present until the end of that hourly interval.
This is a peculiar aspect to Airflow, but an important one to remember - especially if you're using default variables and macros.
Time in Airflow is in UTC by default
This shouldn't come as a surprise given that the rest of your databases and APIs most likely also adhere to this format, but it's worth clarifying.
Full article and source here
I also had a similar issue, but it is mostly related to SubDagOperator with more than 3000 task instances in total (30 tasks * 44 subdag tasks).
What I found out is that airflow scheduler mainly responsible for putting your scheduled tasks in to "Queued Slots" (pool), while airflow celery workers is the one who pick up your queued task and put it into the "Used Slots" (pool) and run it.
Based on your description, your scheduler should work fine. I suggest you check your "celery workers" log to see whether there is any error, or restart it to see whether it helps or not. I experienced some issues that celery workers normally go on strike for a few minutes then start working again (especially on SubDagOperator)
One of the very silly reasons could be that the DAG is "paused" which is the default state for the first time. I lost around 2 hrs fighting it. If you are using Airflow Web interface, then this shows up as a toggle next to your DAG in the list
I am facing the issue today and found that bullet point 4 from tobi6 answer below worked out and resolved the issue
*'Do all the DAGs you want to run have a start date which is in the past?'*
I am using airflow version v1.10.3
My problem was one step further, in addition to my tasks being queued, I couldn't see any of my celery workers on the Flower UI. The solution was that, since I was running my celery worker as root I had to make changes in my ~/.bashrc file.
The following steps made it work:
Add export C_FORCE_ROOT=true to your ~/.bashrc file
source ~/.bashrc
Run worker : nohup airflow worker $* >> ~/airflow/logs/worker.logs &
Check your Flower UI at http://{HOST}:5555
I think it's worth mentioning that there's an open issue that can cause tasks to fail to run with no obvious reason: https://issues.apache.org/jira/browse/AIRFLOW-5506
The problem seems to occur when using LocalScheduler connected to a PostgreSQL airflow db, and results in the scheduler logging a number of "Killing PID xxxx" lines. Check the scheduler logs after the DAGs have been stalled without starting any new tasks for a while.
You can try to stop the webserver and the scheduler:
ps -ef | grep airflow #show the process id
kill 1234 #kill the webserver
kill 5678 #kill the scheduler
Remove the files from the airflow folder if they exist (they will be created again):
airflow-scheduler.err
airflow-scheduler.pid
airflow-webserver.err
airflow-webserver.pid
Start the webserver and the scheduler again.
airflow webserver -D
airflow scheduler -D
-D will make the services run in the background.
I had a similar issue of a triggered DAG "running" indefinitely because its first task stuck in "queued" state.
I realized this was because of a "ghost" DAG that actually changed name. It seems that since the DAG has run in the past (had data in the postgresDG) and was referenced as child-DAG in other DAGs, the trigger of the parent DAGs referencing the old name would "resurrect" the old DAG name, but with the new code. Indeed the old DAG name and new DAG code did not match, thus producing an "infinite queued execution" bug.
Solution:
Delete the all the previous DAG runs of the previous DAG-runs with the old name
Restart everything (webserver, worker, executor,...) OR Delete relevant DAGs (with the "delete DAG" button in the UI).
The interpretation of the bug can vary but this fix worked in my case.
One more thing to check is whether "the concurrency parameter of your DAG reached?".
I'd experienced the same situation when some task was shown as NO STATUS.
It turned out that my File_Sensor tasks were run with timeout set up to 1 week, while DAG time out was only 5 hours. That leaded to the case when the Files were missing, many sensors tasked were running at the same time. Which results the concurrency overloaded!
The depending tasks couldn't be started before the sensor task succeed, when the dag timeout, they got NO STATUS.
My solution:
Carefully set tasks and DAG timeout
Increase dag_concurrency in airflow.cfg file in AIRFLOW_HOME folder.
Please refer to the docs.
https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
I believe this is an issue with celery version 4.2.1 and redis 3.0.1 as described here:
https://github.com/celery/celery/issues/3808
we resolved the issue by downgrading our redis version 2.10.6:
redis==2.10.6
In my case, tasks were not being launched because I had for all operators a pool configured and hadn't created it, hence, tasks were not even scheduled. An operator looks like:
foo = DummyOperator(
task_id='foo',
dag=dag,
pool='capser'
)
To create a pool go to Admin > Pools > Create and set slots, for example, 128, which runs successfully for me. You can also configure by using the CLI.
counter intuitive UI message!
I have spent days on this. So want to elaborate on my specific issue (s).
Each dag has a state. By default the state could be 'pause' or 'not pause'.
The first confusion arises from - what is the default state on startup? The UI message attached seems to indicate that the state is 'not pause' and on clicking the toggle, it pauses.
In reality, the default state is 'pause'. This state can be controlled by settings, environment variables, parameters and UI. I have detailed them below.
The second confusion arises because of the UI again. When we manually trigger a dag which is in the pause state. The UI shows the dag as running (green circle)! But the dag is actually in the 'pause' state. The tasks will not execute unless it is 'un-paused'.
If we read the task instance details. The message would be
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
What is the 'None' state!? And clear which task?!
The actual problem is that the dag is in the pause state. On toggling the dag state the tasks would start to execute.
The pause state of the dag can be changed by
clicking the button on the UI.
set your particular dag to run, by adding the below parameter to your dag
DAG(dag_id='your-dag', is_paused_upon_creation=True)
setting the config variable in airflow.cfg file. (caution: this will start all your dags including the example ones)
dags_are_paused_at_creation = FALSE
configuring an environment variable before starting up the scheduler/webserver.(caution: this will start all your dags including the example ones)
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
Make sure that your task is assigned to the same queue, that your workers is listening to. This means that in your DAG file you have to set 'queue': 'queue_name' and in your worker configuration you have to set either default_queue = 'queue_name' in the airflow.cfg or AIRFLOW__OPERATORS__DEFAULT_QUEUE: 'queue_name' in the docker-compose.yaml (in case you're using Docker).

Resources