Airflow Dagrun for each datum instead of scheduled - airflow

The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?

From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
​
​
def stream():
for x in range(10):
yield x
time.sleep(1)
​
​
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
​
​
#task
def transform(x):
return x * 2
​
​
#task
def load(y):
print("Received y: {}".format(y))
​
​
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
​
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()

You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)

I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.

I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on

We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.

Related

New task in DAG blocks further DAG executions

We have an ETL DAG which is executed daily. DAG and tasks have the following parameters:
catchup=False
max_active_runs=1
depends_on_past=True
When we add a new task, due to depends_on_past property, no new DAG runs get scheduled, as all previous states for new task are missing.
We would like to avoid having to run manual backfill or manually marking previous runs from UI as it can be easily forgotten, and we also have some dynamic DAGs where tasks get added automatically and halt future DAG executions.
Is there a way to automatically set past executions for new tasks as skipped by default, or some other approach that will allow future DAG runs to execute without human intervention?
We also considered creating a maintenance DAG that would insert missing task executions with skipped state, but would rather not go this route.
Are we missing something as the flow looks like a common thing to do?
Defined in Airflow documentation on BaseOperator:
depends_on_past (bool) – when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has
been skipped. The task instance for the start_date is allowed to run.
As long as there exists a previous instance of the task, if that previous instance is not in the success state, the current instance of the task cannot run.
When adding a task to a DAG with existing dagrun, Airflow will create the missing task instances in the None state for all dagruns. Unfortunately, it is not possible to set the default state of task instances.
I do not believe there is a way to allow future task instances of a DAG with existing dagruns to run without human intervention. Personally, for depends_on_past enabled tasks, I will mark the previous task instance as success either through the CLI or the Airflow UI.
Looks like there is an Github Issue describing exactly what you are experiencing! Feel free to bump this PR or take a stab at it if you would like.
A hacky solution is to set depends_on_past to False as max_active_runs=1 will implicitly guarantee the same behavior. As of the current Airflow version, the scheduler orders both dag runs and task instances by execution date before running them (checked 1.10.x but also 2.0)
Another difference is that next execution will be scheduled even if previous fails. We solved this by retrying unlimited times (setting a ver large number), and alert if retry number is larger than some value.

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

Airflow: Only allow one instance of task

Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

flask celery get task real time task status [duplicate]

How does one check whether a task is running in celery (specifically, I'm using celery-django)?
I've read the documentation, and I've googled, but I can't see a call like:
my_example_task.state() == RUNNING
My use-case is that I have an external (java) service for transcoding. When I send a document to be transcoded, I want to check if the task that runs that service is running, and if not, to (re)start it.
I'm using the current stable versions - 2.4, I believe.
Return the task_id (which is given from .delay()) and ask the celery instance afterwards about the state:
x = method.delay(1,2)
print x.task_id
When asking, get a new AsyncResult using this task_id:
from celery.result import AsyncResult
res = AsyncResult("your-task-id")
res.ready()
Creating an AsyncResult object from the task id is the way recommended in the FAQ to obtain the task status when the only thing you have is the task id.
However, as of Celery 3.x, there are significant caveats that could bite people if they do not pay attention to them. It really depends on the specific use-case scenario.
By default, Celery does not record a "running" state.
In order for Celery to record that a task is running, you must set task_track_started to True. Here is a simple task that tests this:
#app.task(bind=True)
def test(self):
print self.AsyncResult(self.request.id).state
When task_track_started is False, which is the default, the state show is PENDING even though the task has started. If you set task_track_started to True, then the state will be STARTED.
The state PENDING means "I don't know."
An AsyncResult with the state PENDING does not mean anything more than that Celery does not know the status of the task. This could be because of any number of reasons.
For one thing, AsyncResult can be constructed with invalid task ids. Such "tasks" will be deemed pending by Celery:
>>> task.AsyncResult("invalid").status
'PENDING'
Ok, so nobody is going to feed obviously invalid ids to AsyncResult. Fair enough, but it also has for effect that AsyncResult will also consider a task that has successfully run but that Celery has forgotten as being PENDING. Again, in some use-case scenarios this can be a problem. Part of the issue hinges on how Celery is configured to keep the results of tasks, because it depends on the availability of the "tombstones" in the results backend. ("Tombstones" is the term use in the Celery documentation for the data chunks that record how the task ended.) Using AsyncResult won't work at all if task_ignore_result is True. A more vexing problem is that Celery expires the tombstones by default. The result_expires setting by default is set to 24 hours. So if you launch a task, and record the id in long-term storage, and more 24 hours later, you create an AsyncResult with it, the status will be PENDING.
All "real tasks" start in the PENDING state. So getting PENDING on a task could mean that the task was requested but never progressed further than this (for whatever reason). Or it could mean the task ran but Celery forgot its state.
Ouch! AsyncResult won't work for me. What else can I do?
I prefer to keep track of goals than keep track of the tasks themselves. I do keep some task information but it is really secondary to keeping track of the goals. The goals are stored in storage independent from Celery. When a request needs to perform a computation depends on some goal having been achieved, it checks whether the goal has already been achieved, if yes, then it uses this cached goal, otherwise it starts the task that will effect the goal, and sends to the client that made the HTTP request a response that indicates it should wait for a result.
The variable names and hyperlinks above are for Celery 4.x. In 3.x the corresponding variables and hyperlinks are: CELERY_TRACK_STARTED, CELERY_IGNORE_RESULT, CELERY_TASK_RESULT_EXPIRES.
Every Task object has a .request property, which contains it AsyncRequest object. Accordingly, the following line gives the state of a Task task:
task.AsyncResult(task.request.id).state
You can also create custom states and update it's value duting task execution.
This example is from docs:
#app.task(bind=True)
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})
http://celery.readthedocs.org/en/latest/userguide/tasks.html#custom-states
Old question but I recently ran into this problem.
If you're trying to get the task_id you can do it like this:
import celery
from celery_app import add
from celery import uuid
task_id = uuid()
result = add.apply_async((2, 2), task_id=task_id)
Now you know exactly what the task_id is and can now use it to get the AsyncResult:
# grab the AsyncResult
result = celery.result.AsyncResult(task_id)
# print the task id
print result.task_id
09dad9cf-c9fa-4aee-933f-ff54dae39bdf
# print the AsyncResult's status
print result.status
SUCCESS
# print the result returned
print result.result
4
Just use this API from celery FAQ
result = app.AsyncResult(task_id)
This works fine.
Answer of 2020:
#### tasks.py
#celery.task()
def mytask(arg1):
print(arg1)
#### blueprint.py
#bp.route("/args/arg1=<arg1>")
def sleeper(arg1):
process = mytask.apply_async(args=(arg1,)) #mytask.delay(arg1)
state = process.state
return f"Thanks for your patience, your job {process.task_id} \
is being processed. Status {state}"
Try:
task.AsyncResult(task.request.id).state
this will provide the Celery Task status. If Celery Task is already is under FAILURE state it will throw an Exception:
raised unexpected: KeyError('exc_type',)
I found helpful information in the
Celery Project Workers Guide inspecting-workers
For my case, I am checking to see if Celery is running.
inspect_workers = task.app.control.inspect()
if inspect_workers.registered() is None:
state = 'FAILURE'
else:
state = str(task.state)
You can play with inspect to get your needs.
First,in your celery APP:
vi my_celery_apps/app1.py
app = Celery(worker_name)
and next, change to the task file,import app from your celery app module.
vi tasks/task1.py
from my_celery_apps.app1 import app
app.AsyncResult(taskid)
try:
if task.state.lower() != "success":
return
except:
""" do something """
res = method.delay()
print(f"id={res.id}, state={res.state}, status={res.status} ")
print(res.get())
for simple tasks, we can use http://flower.readthedocs.io/en/latest/screenshots.html and http://policystat.github.io/jobtastic/ to do the monitoring.
and for complicated tasks, say a task which deals with a lot other modules. We recommend manually record the progress and message on the specific task unit.
Apart from above Programmatic approach
Using Flower Task status can be easily seen.
Real-time monitoring using Celery Events.
Flower is a web based tool for monitoring and administrating Celery clusters.
Task progress and history
Ability to show task details (arguments, start time, runtime, and more)
Graphs and statistics
Official Document:
Flower - Celery monitoring tool
Installation:
$ pip install flower
Usage:
http://localhost:5555
Update:
This has issue with versioning, flower (version=0.9.7) works only with celery (version=4.4.7) more over when you install flower, it uninstalls your higher version of celery into 4.4.7 and this never works for registered tasks

Resources