Deadlock with ExternalTaskOperators: Why, how often, and how to get around that problem? - deadlock

I am implementing scheduled pipelines and currently I am using an ExternalTaskOperator to set inter-DAG dependency. I read here that if you don't manually raise the priority of the upstream tasks it's possible that there will be deadlock.
I was wondering how common this situation is, how you manually raise priority levels of different tasks (the source code of many, like Bash and Python Operators, don't seem to have a priority_level param), and if there are any other better methods of setting inter-DAG dependencies.
Thanks

I've never used ExternalTaskSensor in production, so can't comment on how often deadlocks occur. But apart from priority_weight / weight_rule that you already mentioned, I can think of 2 more ways to try to overcome this
Using Airflow pools to guarantee dedicated slots for sets tasks
Using mode param of Sensors (BaseSensorOperator)

Related

airflow create sub process based on number of files

A newbie question in airflow,
I am having a list of 100+ servers in a text file. Currently, a python script is used to login to each server, read a file, and write the output. It's taking a long time to get the output. If this job is converted to Airflow DAG, is it possible to split the servers into multiple groups and a new task can be initiated by using any operators? Or this can be achieved by only modifying the Python script(like using async) and execute using the Python operator. Seeking advice/best practice. I tried searching for examples but was not able to find one. Thanks!
Airflow is not really a "map-reduce" type of framework (which you seem to be trying to implement). The tasks of Airflow are not (at least currently) designed to split the work between them. This is very atypical for Airflow to have N tasks that do the same thing on a subset of data each. Airflow is more for orchestrating the logic, so each task in Airflow conceptually does a different thing and there are rarely cases where N parallel task do the same thing (but on a different subset of data). More often than not Airflow "tasks" do not "do" the job themselves, they are rather telling others what to do and wait until this gets done.
Typically Airflow can be used to orchestrate such services which excel in doing this kind of processing - you could have a Hadoop job which processes such "parallel" map-reduce kind of jobs using other tools. You could also - as you mentioned - perform an async, multi-threading or even multi-processing python operator, but at some scale, I think typically other, dedicated tools should be much easier to use and better to get the most value of (with efficient utilization of parallelism for example).

activiti taskService complete fails when executed concurrently

Hi I am facing a strange situation where I am trying to set a set of tasks as complete all concurrently.
The first one goes through and second one goes through sometimes (rarely) but mostly doesnt go through.
When I do these individually they work.
Something to do with database locking I feel. Is there some workaround or code for executing task and variable updates concurrently ?
Do they belong to the same process instance?
And yes, there will be a db locking mechanism in place, because when you complete each task a process instance will need to move forward.
Can you please clarify what are you trying to solve? what is your business scenario?
Cheers
Activiti uses pre-emptive locking and this can cause problems for parallel tasks.
Typically if you use the "exclusive" flag the problems go away (https://www.activiti.org/userguide/#exclusiveJobs).
Keep in mind that jobs never actually run in parallel, the job engine selects jobs to run and if there are multiple they will be run sequentially (which appears to be parallel to the user).

Airflow SubDagOperator deadlock

I'm running into a problem where a DAG composed of several SubDagOperators hangs indefinitely.
The setup:
Using CeleryExecutor. For the purposes of this example, let's say we have one worker which can run five tasks concurrently.
The DAG I'm running into problems with runs several SubDagOperators in parallel. For illustration purposes, consider the following graph, where every node is a SubDagOperator:
The problem: The DAG will stop making progress in the high-parallelism part of the DAG. The root cause seems to be that the top-level SubDagOperators take up all five of the slots available for running tasks, so none of the sub-tasks inside of those SubDagOperators are able to run. Those subtasks get stuck in the queued state and no one makes progress.
It was a bit surprising to me that the SubDagOperator tasks would compete with their own subtasks for task running slots, but it makes sense to me now. Are there best practices around writing SubDagOperators that I am violating?
My plan is to work around this problem by creating a custom operator to encapsulate the tasks that are currently encapsulated inside the SubDagOperators. I was wondering if anyone had advice on whether it was advisable to create an operator composed of other operators?
It does seem like SubDagOperator should be avoided because it causes this deadlock issue. I ultimately found that for my use case, I was best served by writing my own custom BaseOperator subclass to do the tasks I was doing inside SubDagOperator. Writing the operator class was much easier than I expected.

How TaskTrackers informs Jobtrackers about their state?

I read about the Apache Hadoop. They said that in HDFS, tasks are any process, that is, mapper or reducer. And they together called jobs.
They have two things, JOBTRACKER, and TASKTRACKER , tasktracker is on each node that manages mapper or reducer tasks.
And, Jobtracker is the one, who manges all task-trackers.
Till now I understand all the concpts theoretically, and all the things are well explained in many blogs.
But I have one doubt, how tasktracker inform jobtracker that given task fail. How they communicate each other. Is they using any other software just like , Apache AVRO.
Please explain me the internal mechanism of this.
Looking for your kind reply.
AVRO has nothing to do with this. It is just a serialization framework, which folks usually use if they feel that Hadoop's serialization is not helping them much. Otherwise it is just another member of the Hadoop ecosystem.
Coming to your original question, it is done through heartbeats, as #thiru_k has specified above. But along with the number of available slots heartbeat signals contains some other info as well, like job status, resource usage, etc. Tasks which don't report their progress for a while are marked as hung or killed. I would suggest you to go through this link, it'll answer all your questions.
The TaskTrackers sends out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated

How commonly do deadlock issues occur in programming?

I've programmed in a number of languages, but I am not aware of deadlocks in my code.
I took this to mean it doesn't happen.
Does this happen frequently (in programming, not in the databases) enough that I should be concerned about it?
Deadlocks could arise if two conditions are true: you have mutilple theads, and they contend for more than one resource.
Do you write multi-threaded code? You might do this explicitly by starting your own threads, or you might work in a framework where the threads are created out of your sight, and so you're running in more than one thread without you seeing that in your code.
An example: the Java Servlet API. You write a servlet or JSP. You deploy to the app server. Several users hit your web site, and hence your servlet. The server will likely have a thread per user.
Now consider what happens if in servicing the requests you want to aquire some resources:
if ( user Is Important ){
getResourceA();
}
getResourceB();
if (today is Thursday ) {
getResourceA();
}
// some more code
releaseResourceA();
releaseResoruceB();
In the contrived example above, think about what might happen on a Thursday when an important user's request arrives, and more or less simultaneously an unimportant user's request arrives.
The important user's thread gets Resoruce A and wants B. The less important user gets resource B and wants A. Neither will let go of the resource that they already own ... deadlock.
This can actually happen quite easily if you are writing code that explicitly uses synchronization. Most commonly I see it happen when using databases, and fortunately databases usually have deadlock detection so we can find out what error we made.
Defense against deadlock:
Acquire resources in a well defined order. In the aboce example, if resource A was always obtained before resource B no deadlock would occur.
If possible use timeouts, so that you don't wait indefinately for a resource. This will allow you to detect contention and apply defense 1.
It would be very hard to give an idea of how often it happens in reality (in production code? in development?) and that wouldn't really give a good idea of how much code is vulnerable to it anyway. (Quite often a deadlock will only occur in very specific situations.)
I've seen a few occurrences, although the most recent one I saw was in an Oracle driver (not in the database at all) due to a finalizer running at the same time as another thread trying to grab a connection. Fortunately I found another bug which let me avoid the finalizer running in the first place...
Basically deadlock is almost always due to trying to acquire one lock (B) whilst holding another one (A) while another thread does exactly the same thing the other way round. If one thread is waiting for B to be released, and the thread holding B is waiting for A to be released, neither is willing to let the other proceed.
Make sure you always acquire locks in the same order (and release them in the reverse order) and you should be able to avoid deadlock in most cases.
There are some odd cases where you don't directly have two locks, but it's the same basic principle. For example, in .NET you might use Control.Invoke from a worker thread in order to update the UI on the UI thread. Now Invoke waits until the update has been processed before continuing. Suppose your background thread holds a lock with the update requires... again, the worker thread is waiting for the UI thread, but the UI thread can't proceed because the worker thread holds the lock. Deadlock again.
This is the sort of pattern to watch out for. If you make sure you only lock where you need to, lock for as short a period as you can get away with, and document the thread safety and locking policies of all your code, you should be able to avoid deadlock. Like all threading topics, however, it's easier said than done.
If you get a chance take a look at first few chapters in Java Concurrency in Practice.
Deadlocks can occur in any concurrent programming situation, so it depends how much concurrency you deal with. Several examples of concurrent programming are: multi-process, multi-thread, and libraries introducing multi-thread. UI frameworks, event handling (such as timer event) could be implemented as threads. Web frameworks could spawn threads to handle multiple web requests simultaneously. With multicore CPUs you might see more concurrent situations visibly than before.
If A is waiting for B, and B is waiting for A, the circular wait causes the deadlock. So, it also depends on the type of code you write as well. If you use distributed transactions, you can easily cause that type of scenario. Without distributed transactions, you risk bank accounts from stealing money.
All depends on what you are coding. Traditional single threaded applications that do not use locking. Not really.
Multi-threaded code with multiple locks is what will cause deadlocks.
I just finished refactoring code that used seven different locks without proper exception handling. That had numerous deadlock issues.
A common cause of deadlocks is when you have different threads (or processes) acquire a set of resources in different order.
E.g. if you have some resource A and B, if thread 1 acquires A and then B, and thread 2 acquires B and then A, then this is a deadlock waiting to happen.
There's a simple solution to this problem: have all your threads always acquire resources in the same order. E.g. if all your threads acquire A and B in that order, you will avoid deadlock.
A deadlock is a situation with two processes are dependent on each other - one cannot finish before the other. Therefore, you will likely only have a deadlock in your code if you are running multiple code flows at any one time.
Developing a multi-threaded application means you need to consider deadlocks. A single-threaded application is unlikely to have deadlocks - but not impossible, the obvious example being that you may be using a DB which is subject to deadlocking.

Resources