How to properly write a SIGPROF handler that invokes AsyncGetCallTrace? - asynchronous

I am writing a short and simple profiler (in C), which is intended to print out stack traces for threads in various Java clients at regular intervals. I have to use the undocumented function AsyncGetCallTrace instead of GetStackTrace to minimize intrusion and allow for stack traces regardless of thread state. The source code for the function can be found here: http://download.java.net/openjdk/jdk6/promoted/b20/openjdk-6-src-b20-21_jun_2010.tar.gz
in hotspot/src/share/vm/prims/forte.cpp. I found some man pages documenting JVMTI, signal handling, and timing, as well as a blog with details on how to set up the AsyncGetCallTrace call: http://jeremymanson.blogspot.com/2007/05/profiling-with-jvmtijvmpi-sigprof-and.html
What this blog is missing is the code to actually invoke the function within the signal handler (the author assumes the reader can do this on his/her own). I am asking for help in doing exactly this. I am not sure how and where to create the struct ASGCT_CallTrace (and the internal struct ASGCT_CallFrame), as defined in the aforementioned file forte.cpp. The struct ASGCT_CallTrace is one of the parameters passed to AsyncGetCallTrace, so I do need to create it, but I don't know how to obtain the correct values for its fields: JNIEnv *env_id, jint num_frames, and JVMPI_CallFrame *frames. Furthermore, I do not know what the third parameter passed to AsyncGetCallTrace (void* ucontext) is supposed to be?
The above problem is the main one I am having. However, other issues I am faced with include:
SIGPROF doesn't seem to be raised by the timer exactly at the specified intervals, but rather a bit less frequently. That is, if I set the timer to send a SIGPROF every second (1 sec, 0 usec), then in a 5 second run, I am getting fewer than 5 SIGPROF handler outputs (usually 1-3)
SIGPROF handler outputs do not appear at all during a Thread.sleep in the Java code. So, if a SIGPROF is to be sent every second, and I have Thread.sleep(5000);, I will not get any handler outputs during the execution of that code.
Any help would be appreciated. Additional details (as well as parts of code and sample outputs) will be posted upon request.

I finally got a positive result, but since little discussion was spawned here, my own answer will be brief.
The ASGCT_CallTrace structure (and the underlying ASGCT_CallFrame array) can simply be declared in the signal handler, thus existing only the stack:
ASGCT_CallTrace trace;
JNIEnv *env;
global_VM_pointer->AttachCurrentThread((void **) &env, NULL);
trace.env_id = env;
trace.num_frames = 0;
ASGCT_CallFrame storage[25];
trace.frames = storage;
The following gets the uContext:
ucontext_t uContext;
getcontext(&uContext);
And then the call is just:
AsyncGetCallTrace(&trace, 25, &uContext);
I am sure there are some other nuances that I had to take care of in the process, but I did not really document them. I am not sure I can disclose the full current code I have, which successfully asynchronously requests for and obtains stack traces of any java program at fixed intervals. But if someone is interested in or stuck on the same problem, I am now able to help (I think).
On the other two issues:
[1] If a thread is sleeping and a SIGPROF is generated, the thread handles that signal only after waking up. This is normal, since it is the thread's job to handle the signal.
[2] The timer imperfections do not seem to appear anymore. Perhaps I mis-measured.

Related

Efficiently connecting an asynchronous IMFSourceReader to a synchronous IMFTransform

Given an asynchronous IMFSourceReader connected to a synchronous only IMFTransform.
Then for the IMFSourceReaderCallback::OnReadSample() callback is it a good idea not to call IMFTransform::ProcessInput directly within OnReadSample, but instead push the produced sample onto another queue for another thread to call the transforms ProcessInput on?
Or would I just be replicating identical work source readers typically do internally? Or put another way does work within OnReadSample run the risk of blocking any further decoding work within the source reader that could have otherwise happened more asynchronously?
So I am suggesting something like:
WorkQueue transformInputs;
...
// Called back async
HRESULT OnReadSampleCallback(... IMFSample* sample)
{
// Push sample and return immediately
Push(transformInputs, sample);
}
// Different worker thread awoken for transformInputs queue samples
void OnTransformInputWork()
{
// Transform object is not async capable
transform->TransformInput(0, Pop(transformInputs), 0);
...
}
This is touched on, but not elaborated on here 'Implementing the Callback Interface':
https://learn.microsoft.com/en-us/windows/win32/medfound/using-the-source-reader-in-asynchronous-mode
Or is it completely dependent on whatever the source reader sets up internally and not easily determined?
It is not a good idea to perform a long blocking operation in IMFSourceReaderCallback::OnReadSample. Nothing is going to be fatal or serious but this is not the intended usage.
Taking into consideration your previous question about audio format conversion though, audio sample data conversion is fast enough to happen on such callback.
Also, it is not clear or documented (depends on actual implementation), ProcessInput is often instant and only references input data. ProcessOutput would be computationally expensive in this case. If you don't do ProcessOutput right there in the same callback you might run into situation where MFT is no longer accepting input, and so you'd have to implement a queue anyway.
With all this in mind you would just do the processing in the callback neglecting performance impact assuming your processing is not too heavy, or otherwise you would just start doing the queue otherwise.

PyQt signals: will a fired signal still do its job after a disconnect()?

I'm dealing with an application where many signals get fired after which a reconnect follows. I'll explain in detail how the application works, and also where my confusion starts.
1. Reconnecting a signal
In my application, I reconnect signals frequently. I will use the following static function for that, taken from the answer of #ekhumoro (and slightly modified) from this post: PyQt Widget connect() and disconnect()
def reconnect(signal, newhandler):
while True:
try:
signal.disconnect()
except TypeError:
break
if newhandler is not None:
signal.connect(newhandler)
2. The application
Imagine the function emitterFunc(self) looping through a list of objects. Upon each iteration, the function connects mySignal to an object, fires the signal and then disconnects mySignal again at the start of the next iteration step. The fired signal also carries some payload, for example an object Foo().
EDIT:
The design shown above is simplified a lot. In the final design, the signal emitter and the receiving slot might operate in different threads.
For reasons that would lead us too far astray, I cannot just connect all the objects at once, emit a signal, and finally disconnect them all. I have to cycle through them one-by-one, doing a connect-emit-disconnect procedure.
Again for reasons that would lead us too far, I cannot just call these slots directly.
3. A mental image of the Signal-Slot mechanism
Over time, I have built up a mental image of how the Signal-Slot mechanism works. I imagine a Signal-Slot engine absorbing all fired signals and putting them in a Queue. Each signal awaits its turn. When time is ready, the engine passes the given signal to the appropriate handler. To do that correctly, the engine has some 'bookkeeping' work to ensure each signal ends up in the right slot.
4. Behavior of the Signal-Slot engine
Imagine we're at the nth iteration step. We connect self.mySignal to object_n. Then we fire the signal with its payload. Almost immediately after doing that, we break the connection and establish a new connection to object_n+1. At the moment we break the connection, the fired signal probably didn't do its job yet. I can imagine three possible behaviors of the Signal-Slot engine:
[OPTION 1] The engine notices that the connection is broken, and discards sig_n from its Queue.
[OPTION 2] The engine notices that the connection is re-established to another handler, and sends sig_n to the handler of object_n+1 (as soon as it gets to the front of the Queue).
[OPTION 3] The engine doesn't change anything for sig_n. When fired, it was intended for the handler of object_n, and that's where it will end up.
5. My questions
My first question is pretty obvious by now. What is the correct Signal-Slot engine behavior? I hope it is the third option.
As a second question, I'd like to know to what extent the given mental image is correct. For example, can I rely on the signals getting out of the Queue in order? This question is less important - it's certainly not vital to my application.
The third question has to do with time efficiency. Is reconnecting to another handler time-consuming? Once I know the answer to the first question, I will proceed building the application and I could measure the reconnection time myself. So this question is not so vital. But if you know the answer anyway, please share :-)
I would start with your second question, to say that your mental image is partially correct, because a queue is involved, but not always. When a signal is emitted, there are three possible ways of calling the connected slot(s) and two of them use the event queue (a QMetaCallEvent is instantiated on the fly and posted with QCoreApplication's method postEvent, where the event target is the slot holder, or the signal receiver, if you prefer). The third case is a direct call, so emitting the signal is just like calling the slot, and nothing get queued.
Now to the first question: in any case, when the signal is emitted, a list of connections (belonging to the signal emitter) is traversed and slots are called one after the other using one of the three methods hinted above. Whenever a connection is made, or broken, the list will be updated, but this necessarily happens before or after the signal is emitted. In short: there is very little chances to succeed in blocking a call to a connected slot after the signal has been emitted, at least not breaking the connection with disconnect(). So I would mark the [OPTION 3] as correct.
If you want to dig further, start from the ConnectionType enum documentation where the three fundamental types of connection (direct, queued and blocking-queued) are well explained. A connection type can be specified as a fifth argument to QObject's method connect, but, as you'll learn from the above linked docs, very often is Qt itself to choose the connection type that best suits the situation. Spoiler: threads are involved :)
About the third question: I have no benchmark tests at hand to show, so I will give a so called primarily opinion based answer, the kind of answer that starts with IMHO. I think that the signal/slot realm is one of those where the keep-it-simple rules do rule, and your reconnect pattern seems to make things much complicated than they need to be. As I hinted above, when a connection is made, a connection object is appended to a list. When the signal is emitted, all the connected slots will be called somehow, one after the other. So, instead of disconnect/reconnect/emit at each cycle in your loop, why don't just connect all items first, then emit the signal, then disconnect them all?
I hope my (long and maybe tldr) answer helped. Good read.

What should be the expected behavior when I call MPI_IRecv/MPI_Recv when there're no messages

I'm new to MPI so bear with me.
I could find any documents on what the expected behavior is in this case.
Let's say I have proc #1 calling MPI_IRecv from ANY_SOURCE but no one has ever sent anything to proc #1, would I receive an empty buffer or would I get an error?
Since MPI_IRecv is the asynchronous version, basically nothing will happen at first.
If you e.g. MPI_Wait on the resulting request the same thing will happen as if you called MPI_Recv directly: Your program will (usually) block until a fitting message arrives.
If you do not ever send that message, your program simply starves.
Note: The "usually" refers to all the possible cases in which an error is created due to some other related or unrelated issue.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

What is call out table in unix?

Can anybody tell me what a 'call out table' is in Unix? Maurice J. Bach gives an explanation in his book Design of the UNIX Operating System, but I'm having difficulty in understanding the examples, especially the one explaining the reason of negative time-out fields. Why are software interrupts used there?
Thanks!
Interrupts stop the current code and start the execution of a high-priority handler; while this handler runs, nothing else can get the CPU. So if you need to do something complex, your interrupt handler would hang the whole system.
The solution: Fill a data structure with all the necessary data and then store this data structure with a pointer to the handler in the call out table. Some service (usually the clock handler) will eventually visit the table and execute the entries one by one in a standard context (i.e. one which doesn't block the process switching).
In System V unix, the kernel or device drivers could schedule some function to be run (or "called out") by the kernel at a later time. The kernel clock handler was in charge of making sure such registered call outs were executed. The call out table was the kernel data structure in which such registered "call outs" were stored.
I don't know to what end they were generally used.

Resources