Airflow SubDagOperator deadlock - airflow

I'm running into a problem where a DAG composed of several SubDagOperators hangs indefinitely.
The setup:
Using CeleryExecutor. For the purposes of this example, let's say we have one worker which can run five tasks concurrently.
The DAG I'm running into problems with runs several SubDagOperators in parallel. For illustration purposes, consider the following graph, where every node is a SubDagOperator:
The problem: The DAG will stop making progress in the high-parallelism part of the DAG. The root cause seems to be that the top-level SubDagOperators take up all five of the slots available for running tasks, so none of the sub-tasks inside of those SubDagOperators are able to run. Those subtasks get stuck in the queued state and no one makes progress.
It was a bit surprising to me that the SubDagOperator tasks would compete with their own subtasks for task running slots, but it makes sense to me now. Are there best practices around writing SubDagOperators that I am violating?
My plan is to work around this problem by creating a custom operator to encapsulate the tasks that are currently encapsulated inside the SubDagOperators. I was wondering if anyone had advice on whether it was advisable to create an operator composed of other operators?

It does seem like SubDagOperator should be avoided because it causes this deadlock issue. I ultimately found that for my use case, I was best served by writing my own custom BaseOperator subclass to do the tasks I was doing inside SubDagOperator. Writing the operator class was much easier than I expected.

Related

airflow create sub process based on number of files

A newbie question in airflow,
I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!
Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).

Deadlock with ExternalTaskOperators: Why, how often, and how to get around that problem?

I am implementing scheduled pipelines and currently I am using an ExternalTaskOperator to set inter-DAG dependency. I read here that if you don't manually raise the priority of the upstream tasks it's possible that there will be deadlock.
I was wondering how common this situation is, how you manually raise priority levels of different tasks (the source code of many, like Bash and Python Operators, don't seem to have a priority_level param), and if there are any other better methods of setting inter-DAG dependencies.
Thanks
I've never used ExternalTaskSensor in production, so can't comment on how often deadlocks occur. But apart from priority_weight / weight_rule that you already mentioned, I can think of 2 more ways to try to overcome this
Using Airflow pools to guarantee dedicated slots for sets tasks
Using mode param of Sensors (BaseSensorOperator)

activiti taskService complete fails when executed concurrently

Hi I am facing a strange situation where I am trying to set a set of tasks as complete all concurrently.
The first one goes through and second one goes through sometimes (rarely) but mostly doesnt go through.
When I do these individually they work.
Something to do with database locking I feel. Is there some workaround or code for executing task and variable updates concurrently ?
Do they belong to the same process instance?
And yes, there will be a db locking mechanism in place, because when you complete each task a process instance will need to move forward.
Can you please clarify what are you trying to solve? what is your business scenario?
Cheers
Activiti uses pre-emptive locking and this can cause problems for parallel tasks.
Typically if you use the "exclusive" flag the problems go away (https://www.activiti.org/userguide/#exclusiveJobs).
Keep in mind that jobs never actually run in parallel, the job engine selects jobs to run and if there are multiple they will be run sequentially (which appears to be parallel to the user).

Race Condition in xv6

I am a newbie in the field of OS and trying to learn it by hacking into xv6.My doubt is can we decide before making a call to fork whether to run parent or child using system calls.i,e i can have a function pass an argument to kernel space and decide whether to run parent or child to run first.The argument can be:
1-parent
0-child.
I think the problem is that fork() just creates a copy of the process and makes it runnable, but the module responsible for allowing it to run is the scheduler. Therefore, the parameter you mentioned should also provide this information to the scheduler in some way.
If you manage to do that, I think you can enqueue the two process in the order you prefer in the runnable queue and let the scheduler pick the first runnable process.
However, you cannot control for how long the first process will run. In fact, at the next scheduling event another process might be allowed to run and the previous one would be suspended.

what are threads in actionscript functions?

I've seen a lot of other developers refer to threads in ActionScript functions. As a newbie I have no idea what they are referring to so:
What is a thread in this sense?
How would I run more than one thread at a time?
How do I ensure that I am only running one thread at a time?
Thanks
~mike
Threads represent a way to have a program appear to perform several jobs concurrently. Although whether or not the jobs can actually occur simultaneously is dependent on several factors (most importantly, whether the CPU the program is running on has multiple cores available to do the work). Threads are useful because they allow work to be done in one context without interfering with another context.
An example will help to illustrate why this is important. Suppose that you have a program which fetches the list of everyone in the phone book whose name matches some string. When people click the "search" button, it will trigger a costly and time-consuming search, which might not complete for a few seconds.
If you have only a single-threaded execution model, the UI will hang and be unresponsive until the search completes. Your program has no choice but to wait for the results to finish.
But if you have several threads, you can offload the search operation to a different thread, and then have a callback -- a trigger which is invoked when the work is completed -- to let you know that things are ready. This frees up the UI and allows it to continue to respond to events.
Unfortunately, because ActionScript's execution model doesn't support threads natively, it's not possible to get true threading. There is a rough approximation called "green threads", which are threads that are controlled by an execution context or virtual machine rather than a larger operating system, which is how it's usually done. Several people have taken a stab at it, although I can't say how widespread their usage is. You can read more at Alex Harui's blog here and see an example of green threads for ActionScript here.
It really depends on what you mean. The execution model for ActionScript is single-threaded, meaning it can not run a process in the background.
If you are not familiar with threading, it is essentially the ability to have something executed in the background of a main process.
So, if you needed to do a huge mathematical computation in your flex/flash project, with a multi-threaded program you could do that in the background while you simultaneously updated your UI. Because ActionScript is not multi-threaded you can not do such things. However, you can create a pseudo-threading class as demonstrated here:
http://blogs.adobe.com/aharui/pseudothread/PseudoThread.as
The others have described what threading is, and you'd need threading if you were getting hardcore into C++ and 3D game engines, among many other computationally-expensive operations, and languages that support multi-threading.
Actionscript doesn't have multi-threading. It executes all code in one frame. So if you create a for loop that processes 100,000,000 items, it will cause the app to freeze. That's because the Flash Player can only execute one thread of code at a time, per frame.
You can achieve pseudo-threading by using:
Timers
Event.ENTER_FRAME
Those allow you to jump around and execute code.
Tween engines like TweenMax can operate on 1000's of objects at once over a few seconds by using Timers. You can also do this with Event.ENTER_FRAME. There is something called "chunking" (check out Grant Skinner's AS3 Optimizations Presentation), which says "execute computationally expensive tasks over a few frames", like drawing complex bitmaps, which is a pseudo-multi-threading thing you can do with actionscript.
A lot of other things are asynchronous, like service calls. If you make an HTTPService request in Flex, it will send a request to the server and then continue executing code in that frame. Once it's done, the server can still be processing that request (say it's saving a 30mb video to a database on the server), and that might take a minute. Then it will send something back to Flex and you can continue code execution with a ResultEvent.RESULT event handler.
So Actionscript basically uses:
Asynchronous events, and
Timers...
... to achieve pseudo-multi-threading.
a thread allows you to execute two or more blocks of actionscrpt simultaniously by default you will always be executing on the same default thread unless you explcitly start a new thread.

Resources