Critical section negative lock count - deadlock

I am debugging a deadlock issue and call stack shows that threads are waiting on some events.
Code is using critical section as synchronization primitive I think there is some issue here.
Also the debugger is pointing to a critical section that is owned by some other thread,but lock count is -2.
As per my understanding lock count>0 means that critical section is locked by one or more threads.
So is there any possibility that I am looking at right critical section which could be the culprit in deadlock.
In what scenarios can a critical section have negative lock count?

Beware: since Windows Server 2003 (for client OS this is Vista and newer) the meaning of LockCount has changed and -2 is a completely normal value, commonly seen when a thread has entered a critical section without waiting and no other thread is waiting for the CS. See Displaying a Critical Section:
In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows:
The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked.
The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken.
The remaining bits are the ones-complement of the number of threads waiting for the lock.

I am assuming that you are talking about CCriticalSection class in MFC. I think you are looking at the right critical section. I have found that the critical section's lock count can go negative if the number of calls to Lock() function is less than the number of Unlock() calls. I found that this generally happens in the following type of code:
void f()
{
CSingleLock lock(&m_synchronizer, TRUE);
//Some logic here
m_synchronizer.Unlock();
}
At the first glance this code looks perfectly safe. However, note that I am using CCriticalSection's Unlock() method directly instead of CSingleLock's Unlock() method. Now what happens is that when the function exits, CSingleLock in its destructor calls Unlock() of the critical section again and its lock count becomes negative. After this the application will be in a bad shape and strange things start to happen. If you are using MFC critical sections then do check for this type of problems.

Related

How to avoid QSS memory leak with :image selector?

I few days search the source on the memory leak in my software and at least found it.
So steps:
I create the GUI application, add image to the .qrc, create form in Qt Designer, add QPushButton there and in the styleSheet property write
#closeButton{ image: url(:/system/images/White/Close.png); }
(the button named as "closeButton")
Without style sheet that I add the program works fine, with style sheet - I receive memory leak.
So how to avoid memory leak in this case?
Objects that survive till the process termination aren't necessarily memory leaks, and the tool won't be able to tell you which ones are memory leaks and which aren't. Memory leaks are usually only the allocations that are made multiple times from the same program location, and never get freed. Even then, it may not always be the case. Leak detection requires a purpose-made test harness that repeats a series of operations that are supposed not to leave behind memory allocated at any given program location more than once. If you then notice that, with increasing number of operations, the number of memory blocks left behind increases, you likely have a real leak. Ideally, the test harness should take snapshots of allocated memory blocks after each "operation cycle", and flag the program locations that consistently leave stuff behind. The library should be able to capture a stack trace to give you the program location where the allocation was made. Otherwise it's useless in practice.
I'm very suspicious of code that deallocates all memory before process termination: usually it's just wasted time, and it prolongs system shutdown and is just bad UX. When the user hits the "Exit" button, make sure that data is safe (e.g. close sqlite files, save open documents - maybe just as "work in progress" that will be brought back the next time the application is used), and then call exit(0).
In general, leak detection takes a bit more than just using a library that gives you a list of memory blocks allocated at exit. The library is a tool, that you, a thinking, reasoning human developer must apply to the problem :) Just as a hammer won't be useful by banging it all over the place (unless you got lots of nails to hammer!), so won't be a "leak detector" library all by itself.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

How to control the number of threads when executing an Asynchronous Activity in WF 4

I am creating a workflow in WF 4, where I have a ParallelForeach activity that iterates over a collection of items. For each item in the collection, I execute a custom Asynchronous activity to processing multiple items in parallel.
The above solution works for me, but I am concerned about the number of threads used since each Asynchronous activity instance is executed on its own thread. Is there a way to configure/control the number of threads that get launched when executing the parallelForeach activity in the above described mechanism?
since each Asynchronous activity instance is getting executed on its own thread. Who says? Certainly not the docs.
ParallelForEach enumerates its values and schedules the Body for every value it enumerates on. It only schedules the Body. How the body executes depends on whether the Body goes idle.
If the Body does not go idle, it executes in a reverse order because the scheduled activities are handled as a stack, the last scheduled activity executes first.
For example, if you have a collection of {1,2,3,4}in ParallelForEach and use a WriteLine as the body to write the value out. You have 4, 3, 2, 1 printed out in the console. This is because WriteLine does not go idle so after 4 WriteLine activities got scheduled, they executed using a stack behavior (first in last out).
The Parallelism of execution occurs only when an Activity creates a bookmark and goes idle. Even then, two activities aren't actually executing at the same time--one or more have just stopped executing, allowing others to run in order. Understandably confusing, given the name, but that's it.
In any event, when you're relying on the framework to parallelize for you, don't worry about how many threads they're using. They probably have everything under control. Until you know they don't.
Will is correct, ParallelForEach does not require a new thread for each branch. If you are doing blocking I/O in code that should occur in an AsyncCodeActivity so that you aren't unecessarily blocking. If you want CPU-bound work to run in parallel to other activities you will either need to wrap it in an AsyncCodeActivity or use InvokeMethod { RunAsynchronously = true} in which case the framework will take care of running the work on a background thread.
The SynchronizationContext extensibility point is intended for cases where you have a particular existing threading model that you need WF to integrate with. Prime examples of this include ASP.NET's threading environment, and Windows Presentation Foundation/WinForms (e.g. if you wanted a activity to work correctly).

Qt QSemaphore's release() does not immediately notify waiters?

I've written a Qt console application to try out QSemaphores and noticed some strange behavior. Consider a semaphore with 1 resource and two threads getting and releasing a single resource. Pseudocode:
QSemaphore sem(1); // init with 1 resource available
thread1()
{
while(1)
{
if ( !sem.tryAquire(1 resource, 1 second timeout) )
{
print "thread1 couldn't get a resource";
}
else
{
sem.release(1);
}
}
}
// basically the same thing
thread2()
{
while(1)
{
if ( !sem.tryAquire(1 resource, 1 second timeout) )
{
print "thread2 couldn't get a resource";
}
else
{
sem.release(1);
}
}
}
Seems straightforward, but the threads will often fail to get a resource. A way to fix this is to put the thread to sleep for a bit after sem.release(1). What this tells me is that the release() member does not allow other threads waiting in tryAquire() access to the semaphore before the current thread loops around to the top of while(1) and grabs the resource again.
This surprises me because similar testing with QMutex showed proper behavior... i.e. another thread hanging out in QMutex::tryLock(timeout) gets notified properly when QMutex::unlock() is called.
Any ideas?
I'm not able to fully test this out or find all of the suppporting links at the moment, but here are a few observations...
First, the documentation for QSemaphore.tryAcquire indicates that the timeout value is in milliseconds, not seconds. So your threads are only waiting 1 millisecond for the resource to become free.
Secondly, I recall reading somewhere (I unfortunately can't remember where) a discussion about what happens when multiple threads are trying to acquire the same resource simultaneously. Although the behavior may vary by OS and situation, it seemed that the typical result is that it is a free-for-all with no one thread being given any more priority than another. As such, a thread waiting to acquire a resource would have just as much chance of getting it as would a thread that had just released it and is attempting to immediately reacquire it. I'm unsure if the priority setting of a thread would affect this.
So, why might you get different results for a QSemaphore versus a QMutex? Well, I think a semaphore may be a more complicated system resource that would take more time to acquire and release than a mutex. I did some simple timing recently for mutexes and found that on average it was taking around 15-25 microseconds to lock or unlock one. In the 1 millisecond your threads are waiting, this would be at least 20 cycles of locking and unlocking, and the odds of the same thread always reacquiring the lock in that time are small. The waiting thread is likely to get at least one bite at the apple in the time that it is waiting, so you won't likely see any acquisition failures when using mutexes in your example.
If, however, releasing and acquiring a semaphore takes much longer (I haven't timed them but I'm guessing they might), then it's more likely that you could just by chance get a situation where one thread is able to keep reacquiring the resource repeatedly until the wait condition for the waiting thread runs out.

How to properly write a SIGPROF handler that invokes AsyncGetCallTrace?

I am writing a short and simple profiler (in C), which is intended to print out stack traces for threads in various Java clients at regular intervals. I have to use the undocumented function AsyncGetCallTrace instead of GetStackTrace to minimize intrusion and allow for stack traces regardless of thread state. The source code for the function can be found here: http://download.java.net/openjdk/jdk6/promoted/b20/openjdk-6-src-b20-21_jun_2010.tar.gz
in hotspot/src/share/vm/prims/forte.cpp. I found some man pages documenting JVMTI, signal handling, and timing, as well as a blog with details on how to set up the AsyncGetCallTrace call: http://jeremymanson.blogspot.com/2007/05/profiling-with-jvmtijvmpi-sigprof-and.html
What this blog is missing is the code to actually invoke the function within the signal handler (the author assumes the reader can do this on his/her own). I am asking for help in doing exactly this. I am not sure how and where to create the struct ASGCT_CallTrace (and the internal struct ASGCT_CallFrame), as defined in the aforementioned file forte.cpp. The struct ASGCT_CallTrace is one of the parameters passed to AsyncGetCallTrace, so I do need to create it, but I don't know how to obtain the correct values for its fields: JNIEnv *env_id, jint num_frames, and JVMPI_CallFrame *frames. Furthermore, I do not know what the third parameter passed to AsyncGetCallTrace (void* ucontext) is supposed to be?
The above problem is the main one I am having. However, other issues I am faced with include:
SIGPROF doesn't seem to be raised by the timer exactly at the specified intervals, but rather a bit less frequently. That is, if I set the timer to send a SIGPROF every second (1 sec, 0 usec), then in a 5 second run, I am getting fewer than 5 SIGPROF handler outputs (usually 1-3)
SIGPROF handler outputs do not appear at all during a Thread.sleep in the Java code. So, if a SIGPROF is to be sent every second, and I have Thread.sleep(5000);, I will not get any handler outputs during the execution of that code.
Any help would be appreciated. Additional details (as well as parts of code and sample outputs) will be posted upon request.
I finally got a positive result, but since little discussion was spawned here, my own answer will be brief.
The ASGCT_CallTrace structure (and the underlying ASGCT_CallFrame array) can simply be declared in the signal handler, thus existing only the stack:
ASGCT_CallTrace trace;
JNIEnv *env;
global_VM_pointer->AttachCurrentThread((void **) &env, NULL);
trace.env_id = env;
trace.num_frames = 0;
ASGCT_CallFrame storage[25];
trace.frames = storage;
The following gets the uContext:
ucontext_t uContext;
getcontext(&uContext);
And then the call is just:
AsyncGetCallTrace(&trace, 25, &uContext);
I am sure there are some other nuances that I had to take care of in the process, but I did not really document them. I am not sure I can disclose the full current code I have, which successfully asynchronously requests for and obtains stack traces of any java program at fixed intervals. But if someone is interested in or stuck on the same problem, I am now able to help (I think).
On the other two issues:
[1] If a thread is sleeping and a SIGPROF is generated, the thread handles that signal only after waking up. This is normal, since it is the thread's job to handle the signal.
[2] The timer imperfections do not seem to appear anymore. Perhaps I mis-measured.

Resources